数据科学应用案例实践报告

小组成员:XXX

主要方法:采用pandas 进行数据处理,采用Pyecharts 进行绘图

摘要: 针对奥运会2020夏季奥运会的相关分析,利用了python里面的pandas和pyecharts等相关的库,实现了数据清洗,数据挖掘,以及数据可视化,将奥运会的每日金牌数和奥运会的相关数据进行了,整理,对数据进行了相关预测。将数据预测与相关变化以数据图表的方式展示出来,更加易于理解。

关键词:奥运会,python,pandas,pyecharts……

Abstract: For the relevant analysis of the Olympic Games in the 2020 Summer Olympics, the relevant libraries such as pandas and pyecharts in python are used to realize data cleaning, data mining, and data visualization. The daily gold medals of the Olympic Games and the relevant data of the Olympic Games are organized and organized. , Made relevant predictions on the data. The data forecasts and related changes are displayed in the form of data charts, which is easier to understand.
Keywords: Olympic Games, python, pandas, pyecharts…

一. 背景:

2020奥运会结束后,对奥运会数据进行数据分析,通过将数据可视化展示出我们奥运会的金牌榜与奥运会的变化,以便于我们可以充分的了解奥运会。

二. 进行数据分析的流程

1. 导入模块

如果缺少库,请输入pip install -r requirements.txt进行安装

!pip install --upgrade pyecharts
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pyecharts in c:\users\merlin_wong\appdata\roaming\python\python39\site-packages (1.9.1)
Requirement already satisfied: jinja2 in c:\users\merlin_wong\appdata\roaming\python\python39\site-packages (from pyecharts) (3.0.3)
Requirement already satisfied: prettytable in c:\users\merlin_wong\appdata\roaming\python\python39\site-packages (from pyecharts) (2.4.0)
Requirement already satisfied: simplejson in c:\users\merlin_wong\appdata\roaming\python\python39\site-packages (from pyecharts) (3.17.6)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\merlin_wong\appdata\roaming\python\python39\site-packages (from jinja2->pyecharts) (2.0.1)
Requirement already satisfied: wcwidth in c:\users\merlin_wong\appdata\roaming\python\python39\site-packages (from prettytable->pyecharts) (0.2.5)
import pandas as pd
from pyecharts.charts import Timeline, Line, Tree
from pyecharts import options as opts
from pyecharts.commons.utils import JsCode

2. Pandas数据处理

2.1 读取数据

df = pd.read_csv('../others/2020东京奥运会奖牌数据.csv', index_col=0, encoding = 'gb18030')
df.head(20)



国家

国家编码

金牌

银牌

铜牌

总计

国旗

日期

2021-07-24

中国

CHN

3

0

1

4

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-07-24

意大利

ITA

1

1

0

2

https://www.sinaimg.cn/ty/2020/Olympic/flag/IT...

2021-07-24

日本

JPN

1

1

0

2

https://www.sinaimg.cn/ty/2020/Olympic/flag/JP...

2021-07-24

韩国

KOR

1

0

2

3

https://www.sinaimg.cn/ty/2020/Olympic/flag/KO...

2021-07-24

厄瓜多尔

ECU

1

0

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/EC...

2021-07-24

匈牙利

HUN

1

0

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/HU...

2021-07-24

伊朗

IRI

1

0

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/IR...

2021-07-24

科索沃

KOS

1

0

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/KO...

2021-07-24

泰国

THA

1

0

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/TH...

2021-07-24

ROC

ROC

0

1

1

2

https://www.sinaimg.cn/ty/2020/Olympic/flag/RO...

2021-07-24

塞尔维亚

SRB

0

1

1

2

https://www.sinaimg.cn/ty/2020/Olympic/flag/SR...

2021-07-24

比利时

BEL

0

1

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/BE...

2021-07-24

西班牙

ESP

0

1

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/ES...

2021-07-24

印度

IND

0

1

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/IN...

2021-07-24

荷兰

NED

0

1

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/NE...

2021-07-24

罗马尼亚

ROU

0

1

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/RO...

2021-07-24

中国台北

TPE

0

1

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/TP...

2021-07-24

突尼斯

TUN

0

1

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/TU...

2021-07-24

爱沙尼亚

EST

0

0

1

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/ES...

2021-07-24

法国

FRA

0

0

1

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/FR...

将csv中的数据导入到我们的项目

2.2 是否有缺失值

df.isnull().any()
国家      False
国家编码    False
金牌      False
银牌      False
铜牌      False
总计      False
国旗      False
dtype: bool

各列数据均不存在缺失情况。

2.3 查看中国每日数据

df1 = df[df['国家']=='中国']
df1



国家

国家编码

金牌

银牌

铜牌

总计

国旗

日期

2021-07-24

中国

CHN

3

0

1

4

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-07-25

中国

CHN

3

1

3

7

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-07-26

中国

CHN

0

4

3

7

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-07-27

中国

CHN

3

0

0

3

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-07-28

中国

CHN

3

1

2

6

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-07-29

中国

CHN

3

1

0

4

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-07-30

中国

CHN

4

3

2

9

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-07-31

中国

CHN

2

3

0

5

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-08-01

中国

CHN

3

1

1

5

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-08-02

中国

CHN

5

3

3

11

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-08-03

中国

CHN

3

4

0

7

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-08-04

中国

CHN

0

1

0

1

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-08-05

中国

CHN

2

2

0

4

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-08-06

中国

CHN

2

2

1

5

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2021-08-07

中国

CHN

1

1

0

2

https://www.sinaimg.cn/ty/2020/Olympic/flag/CH...

2.4 统计中国、美国、日本、澳大利亚4个国家数据

all_country_data = []
flg = {}
cols = ['国家']
countrys = ['中国','美国','日本','澳大利亚']
for country in countrys:
    df1 = df[df['国家']==country]
    df_t = df1.copy()
    df2 = df.loc[df['国家']==country,['金牌','银牌','铜牌','总计']]
    if len(df2.index.tolist()) >= len(cols):
        cols += df2.index.tolist()
    flg[country] = df1.iloc[:1, -1].values[0]
    
    one_country_data = [country]
    datasss = []
    for i in range(df2.shape[0]):    
        datasss.append(df2[:i+1].apply(lambda x:x.sum()).values.tolist())
    d1 = pd.DataFrame(data=datasss, columns=['金牌','银牌','铜牌','总计'])
    for col in d1.columns:
        df_t[col] = d1[col].values
    df_t1 = df_t.loc[:,['金牌']]
    one_country_data += df_t['金牌'].values.tolist()
    all_country_data.append(one_country_data)
all_country_data
[['中国', 3, 6, 6, 9, 12, 15, 19, 21, 24, 29, 32, 32, 34, 36, 37],
 ['美国', 4, 7, 9, 11, 14, 14, 16, 20, 22, 24, 25, 29, 31, 31],
 ['日本', 1, 5, 8, 10, 13, 15, 17, 18, 18, 18, 19, 21, 22, 24],
 ['澳大利亚', 1, 2, 3, 6, 8, 9, 10, 14, 14, 15, 17, 17]]

dataFrame更新

d2 = pd.DataFrame(data=all_country_data,columns=cols)
d2 = d2.fillna(method = 'ffill',axis=1)
d2



国家

2021-07-24

2021-07-25

2021-07-26

2021-07-27

2021-07-28

2021-07-29

2021-07-30

2021-07-31

2021-08-01

2021-08-02

2021-08-03

2021-08-04

2021-08-05

2021-08-06

2021-08-07

0

中国

3

6

6

9

12

15

19

21

24

29

32

32

34.0

36.0

37.0

1

美国

4

7

9

11

14

14

16

20

22

24

25

29

31.0

31.0

31.0

2

日本

1

5

8

10

13

15

17

18

18

18

19

21

22.0

24.0

24.0

3

澳大利亚

1

2

3

6

8

9

10

14

14

15

17

17

17

17

17

可根据需要获取多个国家数据,改变countrys列表即可。

3. Pyecharts绘图

3.1 绘制基础折线图

CHN = []
x_data=cols[1:]
for d_time in cols[1:]:
    CHN.append(d2[d_time][d2['国家']=='中国'].values.tolist()[0])
l1 = (
    Line()
    .add_xaxis(x_data)
    # 中国线条
    .add_yaxis(
        '中国',
        CHN,
        label_opts=opts.LabelOpts(is_show=True))
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title='中国金牌',
            pos_left='center',
        ),
        yaxis_opts=opts.AxisOpts(
            name='金牌/枚',            
            is_scale=True,
            max_=40),
        legend_opts=opts.LegendOpts(is_show=False),
    ))
l1.render_notebook()
<div id="df8cdd80eb3c45b8804f3c6d90582c11" style="width:900px; height:500px;"></div>

3.2 加载样式

# 背景色
background_color_js = (
    "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
    "[{offset: 0, color: '#d9d9d9'}, {offset: 1, color: '#ffd966'}], false)"
)

# 线条样式
linestyle_dic = { 'normal': {
                    'width': 4,  
                    'shadowColor': '#696969', 
                    'shadowBlur': 10,  
                    'shadowOffsetY': 10,  
                    'shadowOffsetX': 10,  
                    }
                }
    
timeline = Timeline(init_opts=opts.InitOpts(bg_color=JsCode(background_color_js),
                                            width='980px',height='600px'))
timeline.add_schema(is_auto_play=True, is_loop_play=True, 
                    is_timeline_show=True, play_interval=500)

CHN = []
x_data=cols[1:]
for d_time in cols[1:]:
    CHN.append(d2[d_time][d2['国家']=='中国'].values.tolist()[0])
line = (
    Line(init_opts=opts.InitOpts(bg_color=JsCode(background_color_js),
                                 width='980px',height='600px'))
    .add_xaxis(x_data)
    # 中国线条
    .add_yaxis(
        '中国',
        CHN,
        symbol_size=10,
        is_smooth=True,
        label_opts=opts.LabelOpts(is_show=True),
        markpoint_opts=opts.MarkPointOpts(
                data=[  opts.MarkPointItem(
                        name="",
                        type_='max',
                        value_index=0,
                        symbol='image://'+ flg['中国'],
                        symbol_size=[40, 25],
                    )],
                label_opts=opts.LabelOpts(is_show=False),
            )
    )
    .set_series_opts(linestyle_opts=linestyle_dic,label_opts=opts.LabelOpts(font_size=12, color='red' ))
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title='中国金牌',
            pos_left='center',
            pos_top='2%',
            title_textstyle_opts=opts.TextStyleOpts(
                    color='#DC143C', font_size=20)
        ),
        xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=14, color='red'),
                                 axisline_opts=opts.AxisLineOpts(is_show=True,
                                    linestyle_opts=opts.LineStyleOpts(width=2, color='#DB7093'))),
        yaxis_opts=opts.AxisOpts(
            name='金牌/枚',            
            is_scale=True,
            max_=40,
            name_textstyle_opts=opts.TextStyleOpts(font_size=16,font_weight='bold',color='#FFD700'),
            axislabel_opts=opts.LabelOpts(font_size=13,color='red'),
            splitline_opts=opts.SplitLineOpts(is_show=True, 
                                              linestyle_opts=opts.LineStyleOpts(type_='dashed')),
            axisline_opts=opts.AxisLineOpts(is_show=True,
                                    linestyle_opts=opts.LineStyleOpts(width=2, color='#DB7093'))
        ),
        legend_opts=opts.LegendOpts(is_show=False, pos_right='1.5%', pos_top='2%',
                                    legend_icon='roundRect',orient = 'horizontal'),
    ))
line.render_notebook()
<div id="dc3037b44d38492aa44c5ed8e10d86c7" style="width:980px; height:600px;"></div>

3.3 动态展示中国每日金牌数据

# 背景色
background_color_js = (
    "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
    "[{offset: 0, color: '#d9d9d9'}, {offset: 1, color: '#ffd966'}], false)"
)

# 线条样式
linestyle_dic = { 'normal': {
                    'width': 4,  
                    'shadowColor': '#696969', 
                    'shadowBlur': 10,  
                    'shadowOffsetY': 10,  
                    'shadowOffsetX': 10,  
                    }
                }
    
timeline = Timeline(init_opts=opts.InitOpts(bg_color=JsCode(background_color_js),
                                            width='980px',height='600px'))
timeline.add_schema(is_auto_play=True, is_loop_play=True, 
                    is_timeline_show=True, play_interval=500)

CHN = []
x_data=cols[1:]
for d_time in cols[1:]:
    CHN.append(d2[d_time][d2['国家']=='中国'].values.tolist()[0])
    line = (
        Line(init_opts=opts.InitOpts(bg_color=JsCode(background_color_js),
                                     width='980px',height='600px'))
        .add_xaxis(x_data)
        # 中国线条
        .add_yaxis(
            '中国',
            CHN,
            symbol_size=10,
            is_smooth=True,
            label_opts=opts.LabelOpts(is_show=True),
            markpoint_opts=opts.MarkPointOpts(
                    data=[  opts.MarkPointItem(
                            name="",
                            type_='max',
                            value_index=0,
                            symbol='image://'+ flg['中国'],
                            symbol_size=[40, 25],
                        )],
                    label_opts=opts.LabelOpts(is_show=False),
                )
        )
        .set_series_opts(linestyle_opts=linestyle_dic,label_opts=opts.LabelOpts(font_size=12, color='red' ))
        .set_global_opts(
            title_opts=opts.TitleOpts(
                title='中国金牌',
                pos_left='center',
                pos_top='2%',
                title_textstyle_opts=opts.TextStyleOpts(color='#DC143C', font_size=20)),
            xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=14, color='red'),
                         axisline_opts=opts.AxisLineOpts(is_show=True,
                            linestyle_opts=opts.LineStyleOpts(width=2, color='#DB7093'))),
            yaxis_opts=opts.AxisOpts(
                name='金牌/枚',            
                is_scale=True,
                max_=40,
                name_textstyle_opts=opts.TextStyleOpts(font_size=16,font_weight='bold',color='#FFD700'),
                axislabel_opts=opts.LabelOpts(font_size=13,color='red',rotate=15),
                splitline_opts=opts.SplitLineOpts(is_show=True, 
                                                  linestyle_opts=opts.LineStyleOpts(type_='dashed')),
                axisline_opts=opts.AxisLineOpts(is_show=True,
                                        linestyle_opts=opts.LineStyleOpts(width=2, color='#DB7093'))
            ),
            legend_opts=opts.LegendOpts(is_show=True, pos_right='1%', pos_top='2%',
                                        legend_icon='roundRect',orient = 'vertical'),
        ))
    timeline.add(line, '{}'.format(d_time))

timeline.render_notebook()
<div id="ee4506559c1742cba49e952ffd6ac889" style="width:980px; height:600px;"></div>

3.4 增加其他国家每日金牌数据

# 背景色
background_color_js = (
    "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
    "[{offset: 0, color: '#d9d9d9'}, {offset: 1, color: '#ffd966'}], false)"
)

# 线条样式
linestyle_dic = { 'normal': {
                    'width': 4,  
                    'shadowColor': '#696969', 
                    'shadowBlur': 10,  
                    'shadowOffsetY': 10,  
                    'shadowOffsetX': 10,  
                    }
                }
    
timeline = Timeline(init_opts=opts.InitOpts(bg_color=JsCode(background_color_js),
                                            width='980px',height='600px'))
timeline.add_schema(is_auto_play=True, is_loop_play=True, 
                    is_timeline_show=True, play_interval=500)

CHN, USA, JPN, AUS = [], [], [], []
x_data=cols[1:]
for d_time in cols[1:]:
    CHN.append(d2[d_time][d2['国家']=='中国'].values.tolist()[0])
    USA.append(d2[d_time][d2['国家']=='美国'].values.tolist()[0])
    JPN.append(d2[d_time][d2['国家']=='日本'].values.tolist()[0])
    AUS.append(d2[d_time][d2['国家']=='澳大利亚'].values.tolist()[0])
    line = (
        Line(init_opts=opts.InitOpts(bg_color=JsCode(background_color_js),
                                     width='980px',height='600px'))
        .add_xaxis(x_data)
        # 中国线条
        .add_yaxis(
            '中国',
            CHN,
            symbol_size=10,
            is_smooth=True,
            label_opts=opts.LabelOpts(is_show=True),
            markpoint_opts=opts.MarkPointOpts(
                    data=[  opts.MarkPointItem(
                            name="",
                            type_='max',
                            value_index=0,
                            symbol='image://'+ flg['中国'],
                            symbol_size=[40, 25],
                        )],
                    label_opts=opts.LabelOpts(is_show=False),
                )
        )
        # 美国线条
        .add_yaxis(
            '美国',
            USA,
            symbol_size=5,
            is_smooth=True,
            label_opts=opts.LabelOpts(is_show=True),
            markpoint_opts=opts.MarkPointOpts(
                    data=[
                        opts.MarkPointItem(
                            name="",
                            type_='max',
                            value_index=0,
                            symbol='image://'+ flg['美国'],
                            symbol_size=[40, 25],
                        )
                    ],
                    label_opts=opts.LabelOpts(is_show=False),
                )
        )
        # 日本线条
        .add_yaxis(
            '日本',
            JPN,
            symbol_size=5,
            is_smooth=True,
            label_opts=opts.LabelOpts(is_show=True),
            markpoint_opts=opts.MarkPointOpts(
                    data=[  opts.MarkPointItem(
                            name="",
                            type_='max',
                            value_index=0,
                            symbol='image://'+ flg['日本'],
                            symbol_size=[40, 25],
                        )],
                    label_opts=opts.LabelOpts(is_show=False),
                )
        )
        # 澳大利亚线条
        .add_yaxis(
            '澳大利亚',
            AUS,
            symbol_size=5,
            is_smooth=True,
            label_opts=opts.LabelOpts(is_show=True),
            markpoint_opts=opts.MarkPointOpts(
                    data=[  opts.MarkPointItem(
                            name="",
                            type_='max',
                            value_index=0,
                            symbol='image://'+ flg['澳大利亚'],
                            symbol_size=[40, 25],
                        )],
                    label_opts=opts.LabelOpts(is_show=False),
                )
        )
        .set_series_opts(linestyle_opts=linestyle_dic)
        .set_global_opts(
            title_opts=opts.TitleOpts(
                title='中国 VS 美国 VS 日本 VS 澳大利亚',
                pos_left='center',
                pos_top='2%',
                title_textstyle_opts=opts.TextStyleOpts(
                        color='#DC143C', font_size=20)
            ),
            xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=14, color='red'),
                         axisline_opts=opts.AxisLineOpts(is_show=True,
                            linestyle_opts=opts.LineStyleOpts(width=2, color='#DB7093'))),
            yaxis_opts=opts.AxisOpts(
                name='金牌/枚',            
                is_scale=True,
                max_=40,
                name_textstyle_opts=opts.TextStyleOpts(font_size=16,font_weight='bold',color='#FFD700'),
                axislabel_opts=opts.LabelOpts(font_size=13,color='red',rotate=15),
                splitline_opts=opts.SplitLineOpts(is_show=True, 
                                                  linestyle_opts=opts.LineStyleOpts(type_='dashed')),
                axisline_opts=opts.AxisLineOpts(is_show=True,
                                        linestyle_opts=opts.LineStyleOpts(width=2, color='#DB7093'))
            ),
            legend_opts=opts.LegendOpts(is_show=True, pos_right='1%', pos_top='2%',
                                        legend_icon='roundRect',orient = 'vertical'),
        ))
    timeline.add(line, '{}'.format(d_time))
timeline.render_notebook()
<div id="e831c9c3c8564956b748d3f05df4b186" style="width:980px; height:600px;"></div>