经过前面的学习,下面来看⼀些真实世界的数据集。对于每个数据集,我们会⽤之前介绍的⽅法,从原始数据中提取有意义的内容。展示的⽅法适⽤于其它数据集,也包括你的。本篇包含了⼀些各种各样的案例数据集,可以⽤来练习。

案例数据集可以在Github仓库找到。

一、来⾃Bitly的USA.gov数据
2011年,URL缩短服务Bitly跟美国政府⽹站USA.gov合作,提供了⼀份从⽣成.gov或.mil短链接的⽤户那⾥收集来的匿名数据。在2011年,除实时数据之外,还可以下载⽂本⽂件形式的每⼩时快照。这项服务现在已经关闭,但我们保存⼀份数据⽤于本篇的案例。

以每⼩时快照为例,⽂件中各⾏的格式为JSON(即JavaScriptObject Notation,这是⼀种常⽤的Web数据格式)。
例如,如果我们只读取某个⽂件中的第⼀⾏,那么所看到的结果应该是下⾯这样:
path = 'datasets/bitly_usagov/example.txt'
open(path).readline()               # 输出如下:
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11
(KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1,
"tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l":
"orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r":
"http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":
"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc":
1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

Python有内置或第三⽅模块可以将JSON字符串转换成Python字典对象。这⾥,我将使⽤json模块及其loads函数逐⾏加载已经下载好的数据⽂件:
import json
path = 'datasets/bitly_usagov/example.txt'
records = [json.loads(line) for line in open(path)]

现在,records对象就成为⼀组Python字典了:
records[0]              # 输出如下:
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
  'c': 'US',
  'nk': 1,
  'tz': 'America/New_York',
  'gr': 'MA',
  'g': 'A6qOVH',
  'h': 'wfLQtf',
  'l': 'orofrog',
  'al': 'en-US,en;q=0.8',
  'hh': '1.usa.gov',
  'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
  'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991',
  't': 1331923247,
  'hc': 1331822918,
  'cy': 'Danvers',
  'll': [42.576698, -70.954903]}

1、⽤纯Python代码对时区进⾏计数
假设我们想要知道该数据集中最常出现的是哪个时区(即tz字段),得到答案的办法有很多。⾸先,我们⽤列表推导式取出⼀组时区:
time_zones = [rec['tz'] for rec in records]     # 执行报错,输出如下:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-f3fbbc37f129> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]
<ipython-input-22-f3fbbc37f129> in <listcomp>(.0)
----> 1 time_zones = [rec['tz'] for rec in records]
KeyError: 'tz'
因为并不是所有记录都有时区字段。此时,只需在列表推导式末尾加上⼀个if 'tz'in rec判断即可:
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
time_zones[:10]                     # 输出如下:
['America/New_York',
  'America/Denver',
  'America/New_York',
  'America/Sao_Paulo',
  'America/New_York',
  'America/New_York',
  'Europe/Warsaw',
  '',
  '',
  '']

只看前10个时区,我们发现有些是未知的(即空的)。虽然可以将它们过滤掉,但现在暂时先留着。接下来,为了对时区进⾏计数,这⾥介绍两个办法:⼀个较难(只使⽤标准Python库),另⼀个较简单(使⽤pandas)。计数的办法之⼀是在遍历时区的过程中将计数值保存在字典中:
def get_counts(sequence):
       counts = {}
       for x in sequence:
             if x in counts:
                  counts[x] += 1
           else:
                  counts[x] = 1
       return counts
如果使⽤Python标准库的更⾼级⼯具,那么你可能会将代码写得更简洁⼀些:
from collections import defaultdict
def get_counts2(sequence):
       counts = defaultdict(int)
       for x in sequence:
             counts[x] += 1
       return counts
      
我将逻辑写到函数中是为了获得更⾼的复⽤性。要⽤它对时区进⾏处理,只需将time_zones传⼊即可:
counts = get_counts(time_zones)
counts['America/New_York']          # 输出:1251
len(time_zones)                     # 输出:3440

如果想要得到前10位的时区及其计数值,我们需要⽤到⼀些有关字典的处理技巧:
def top_counts(count_dict, n=10):
       value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
       value_key_pairs.sort()
       return value_key_pairs[-n:]
然后有:
top_counts(counts)      # 输出如下:
[(33, 'America/Sao_Paulo'),
(35, 'Europe/Madrid'),
(36, 'Pacific/Honolulu'),
(37, 'Asia/Tokyo'),
(74, 'Europe/London'),
(191, 'America/Denver'),
(382, 'America/Los_Angeles'),
(400, 'America/Chicago'),
(521, ''),
(1251, 'America/New_York')]
 
如果你搜索Python的标准库,你能找到collections.Counter类,它可以使这项⼯作更简单:
from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)              # 输出如下:
[('America/New_York', 1251),
  ('', 521),
  ('America/Chicago', 400),
  ('America/Los_Angeles', 382),
  ('America/Denver', 191),
  ('Europe/London', 74),
  ('Asia/Tokyo', 37),
  ('Pacific/Honolulu', 36),
  ('Europe/Madrid', 35),
  ('America/Sao_Paulo', 33)]

2、⽤pandas对时区进⾏计数
从原始记录的集合创建DateFrame,与将记录列表传递到pandas.DataFrame⼀样简单:
import pandas as pd
frame = pd.DataFrame(records)
frame.info()            # 输出如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3560 entries, 0 to 3559
Data columns (total 18 columns):
_heartbeat_       120 non-null float64
a                        3440 non-null object
al                       3094 non-null object
c                        2919 non-null object
cy                      2919 non-null object
g                       3440 non-null object
gr                      2919 non-null object
h                       3440 non-null object
hc                     3440 non-null float64
hh                     3440 non-null object
kw                     93 non-null object
l                        3440 non-null object
ll                       2919 non-null object
nk                     3440 non-null float64
r                       3440 non-null object
t                       3440 non-null float64
tz                     3440 non-null object
u                      3440 non-null object
dtypes: float64(4), object(14)
memory usage: 500.7+ KB
frame['tz'][:10]        # 输出如下:获取tz列的前10个元素
0     America/New_York
1         America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6           Europe/Warsaw
7
8
9
Name: tz, dtype: object

这⾥frame的输出形式是摘要视图(summary view),主要⽤于较⼤的DataFrame对象。我们然后可以对Series使⽤value_counts⽅法:
tz_counts = frame['tz'].value_counts()
tz_counts[:10]          # 输出如下:
America/New_York                1251
                                                521
America/Chicago                    400
America/Los_Angeles             382
America/Denver                     191
Europe/London                       74
Asia/Tokyo                              37
Pacific/Honolulu                     36
Europe/Madrid                       35
America/Sao_Paulo                33
Name: tz, dtype: int64

我们可以⽤matplotlib可视化这个数据。为此,我们先给记录中未知或缺失的时区填上⼀个替代值。fillna函数可以替换缺失值(NA),⽽未知值(空字符串)则可以通过布尔型数组索引加以替换:
clean_tz = frame['tz'].fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'
tz_counts = clean_tz.value_counts()
tz_counts[:10]          # 输出如下:
America/New_York                1251
Unknown                                 521
America/Chicago                    400
America/Los_Angeles             382
America/Denver                     191
Missing                                   120
Europe/London                        74
Asia/Tokyo                               37
Pacific/Honolulu                      36
Europe/Madrid                         35
Name: tz, dtype: int64

此时,我们可以⽤seaborn包创建⽔平柱状图(结果⻅图14-1):
import seaborn as sns
subset = tz_counts[:10]
sns.barplot(y=subset.index, x=subset.values)    # 输出图形14-1

数据分析项目 数据分析项目经验案例_Windows                                图14-1  usa.gov示例数据中最常出现的时区

a字段含有执⾏URL短缩操作的浏览器、设备、应⽤程序的相关信息:
frame['a'][1]           # 输出:'GoogleMaps/RochesterNY'
frame['a'][50]          # 输出如下:
'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
frame['a'][51][:50]     # a列第51行前50个字符,输出如下:
'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P9'

将这些"agent"字符串中的所有信息都解析出来是⼀件挺郁闷的⼯作。⼀种策略是将这种字符串的第⼀节(与浏览器⼤致对应)分离出来并得到另外⼀份⽤户⾏为摘要:
results = pd.Series([x.split()[0] for x in frame.a.dropna()])           # 提取数据
results[:5]             # 查看结果,输出如下:
0                             Mozilla/5.0
1    GoogleMaps/RochesterNY
2                             Mozilla/4.0
3                             Mozilla/5.0
4                             Mozilla/5.0
dtype: object
results.value_counts()[:8]          # 统计数量,输出如下:
Mozilla/5.0                              2594
Mozilla/4.0                                601
GoogleMaps/RochesterNY       121
Opera/9.80                                 34
TEST_INTERNET_AGENT             24
GoogleProducer                         21
Mozilla/6.0                                   5
BlackBerry8520/5.0.0.681            4
dtype: int64

现在,假设你想按Windows和⾮Windows⽤户对时区统计信息进⾏分解。为了简单起⻅,我们假定只要agent字符串中含有"Windows"就认为该⽤户为Windows⽤户。由于有的agent缺失,所以⾸先将它们从数据中移除:
cframe = frame[frame.a.notnull()]               # 过滤a列不为空的行
然后计算出各⾏是否含有Windows的值:str.contains()方法是pandas的字符串方法,不是Python的
cframe['os'] = np.where(cframe['a'].str.contains('Windows'), 'Windows', 'Not Windows')
cframe['os'][:5]        # 输出如下:
0           Windows
1    Not Windows
2           Windows
3    Not Windows
4           Windows
Name: os, dtype: object
接下来就可以根据时区和新得到的操作系统列表对数据进⾏分组了:
by_tz_os = cframe.groupby(['tz', 'os'])
分组计数,类似于value_counts函数,可以⽤size来计算。并利⽤unstack对计数结果进⾏重塑:
agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]                     # 输出如下:
os                                                             Not Windows  Windows
tz
                                                                             245.0         276.0
Africa/Cairo                                                             0.0             3.0
Africa/Casablanca                                                   0.0             1.0
Africa/Ceuta                                                            0.0             2.0
Africa/Johannesburg                                               0.0             1.0
Africa/Lusaka                                                           0.0             1.0
America/Anchorage                                                4.0             1.0
America/Argentina/Buenos_Aires                           1.0             0.0
America/Argentina/Cordoba                                   0.0             1.0
America/Argentina/Mendoza                                  0.0             1.0

最后,我们来选取最常出现的时区。为了达到这个⽬的,我根据agg_counts中的⾏数构造了⼀个间接索引数组:
# Use to sort in ascending order
indexer = agg_counts.sum(1).argsort()    # 按行求和
indexer[:10]            # 输出如下:
tz
                                                           24
Africa/Cairo                                        20
Africa/Casablanca                              21
Africa/Ceuta                                       92
Africa/Johannesburg                          87
Africa/Lusaka                                      53
America/Anchorage                           54
America/Argentina/Buenos_Aires      57
America/Argentina/Cordoba              26
America/Argentina/Mendoza             55
dtype: int64

然后我通过take按照这个顺序截取了最后10⾏最⼤值:
count_subset = agg_counts.take(indexer[-10:])
count_subset            # 输出如下:
os                                           Not Windows  Windows
tz
America/Sao_Paulo                             13.0          20.0
Europe/Madrid                                    16.0          19.0
Pacific/Honolulu                                    0.0          36.0
Asia/Tokyo                                             2.0          35.0
Europe/London                                    43.0          31.0
America/Denver                                 132.0          59.0
America/Los_Angeles                        130.0        252.0
America/Chicago                               115.0        285.0
                                                            245.0        276.0
America/New_York                            339.0        912.0

pandas有⼀个简便⽅法nlargest,可以做同样的⼯作:
agg_counts.sum(1).nlargest(10)      # 输出如下:按行求和,并取值最大的10个
tz
America/New_York             1251.0
                                              521.0
America/Chicago                  400.0
America/Los_Angeles           382.0
America/Denver                   191.0
Europe/London                      74.0
Asia/Tokyo                             37.0
Pacific/Honolulu                    36.0
Europe/Madrid                       35.0
America/Sao_Paulo                33.0
dtype: float64
然后,如这段代码所示,可以⽤柱状图表示。我传递⼀个额外参数到seaborn的barpolt函数,来画⼀个堆积条形图(⻅图14-2):
# Rearrange the data for plotting
count_subset = count_subset.stack()
count_subset.name = 'total'
count_subset = count_subset.reset_index()
count_subset[:10]                   # 输出如下:
                                tz                      os          total
0   America/Sao_Paulo   Not Windows          13.0
1   America/Sao_Paulo          Windows          20.0
2         Europe/Madrid    Not Windows         16.0
3         Europe/Madrid           Windows         19.0
4       Pacific/Honolulu    Not Windows           0.0
5       Pacific/Honolulu           Windows         36.0
6                Asia/Tokyo    Not Windows           2.0
7                Asia/Tokyo           Windows          35.0
8         Europe/London    Not Windows          43.0
9         Europe/London           Windows          31.0
sns.barplot(x='total', y='tz', hue='os', data=count_subset)             # 输出图形14-2

数据分析项目 数据分析项目经验案例_Python_02                                       图14-2  最常出现时区的Windows和⾮Windows⽤户

这张图不容易看出Windows⽤户在⼩分组中的相对⽐例,因此标准化分组百分⽐之和为1:
def norm_total(group):
       group['normed_total'] = group.total / group.total.sum()
       return group
results = count_subset.groupby('tz').apply(norm_total)
再次画图,⻅图14-3:
sns.barplot(x='normed_total', y='tz', hue='os', data=results)           # 输出图14-3,根据os列画堆积图形

数据分析项目 数据分析项目经验案例_数据_03                                 图14-3  最常出现时区的Windows和⾮Windows⽤户的百分⽐

我们还可以⽤groupby的transform⽅法,更⾼效的计算标准化的和:
g = count_subset.groupby('tz')
results2 = count_subset.total / g.total.transform('sum')

二、MovieLens 1M数据集
GroupLens Research(http://www.grouplens.org/node/73)采集了⼀组从20世纪90年末到21世纪初由MovieLens⽤户提供的电影评分数据。这些数据中包括电影评分、电影元数据(⻛格类型和年代)以及关于⽤户的⼈⼝统计学数据(年龄、邮编、性别和职业等)。基于机器学习算法的推荐系统⼀般都会对此类数据感兴趣。虽然这里不会详细介绍机器学习技术,但我会告诉你如何对这种数据进⾏切⽚切块以满⾜实际需求。

MovieLens 1M数据集含有来⾃6000名⽤户对4000部电影的100万条评分数据。它分为三个表:评分、⽤户信息和电影信息。将该数据从zip⽂件中解压出来之后,可以通过pandas.read_table将各个表分别读到⼀个pandas DataFrame对象中:
import pandas as pd

# Make display smaller
pd.options.display.max_rows = 10

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('datasets/movielens/users.dat', sep='::',
                                      header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('datasets/movielens/ratings.dat', sep='::',
                                         header=None, names=rnames)
                                        
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                                          header=None, names=mnames)

利⽤Python的切⽚语法,通过查看每个DataFrame的前⼏⾏即可验证数据加载⼯作是否⼀切顺利:
users[:5]               # same as users.head(),输出如下:
    user_id    gender  age  occupation           zip
0          1              F      1                 10      48067
1          2             M   56                  16     70072
2          3             M   25                  15     55117
3          4             M   45                    7     02460
4          5             M   25                  20     55455
ratings[:5]             # 输出如下:
      user_id    movie_id    rating       timestamp
0             1           1193           5      978300760
1             1             661           3      978302109
2             1             914           3      978301968
3             1           3408           4      978300275
4             1           2355           5      978824291
movies.head()           # 输出如下:
       movie_id                                                       title                                           genres
0                 1                                    Toy Story (1995)       Animation|Children's|Comedy
1                 2                                       Jumanji (1995)         Adventure|Children's|Fantasy
2                 3                     Grumpier Old Men (1995)                         Comedy|Romance
3                 4                        Waiting to Exhale (1995)                             Comedy|Drama
4                 5            Father of the Bride Part II (1995)                                         Comedy
ratings                 # 输出如下:
                user_id    movie_id    rating         timestamp
0                       1           1193           5        978300760
1                       1             661           3        978302109
2                       1             914           3        978301968
3                       1           3408           4        978300275
4                       1           2355           5        978824291
...                     ...                 ...           ...        ...
1000204     6040           1091           1        956716541
1000205     6040           1094           5        956704887
1000206     6040             562           5        956704746
1000207     6040           1096           4        956715648
1000208     6040           1097           4        956715569
[1000209 rows x 4 columns]

注意,其中的年龄和职业是以编码形式给出的,它们的具体含义请参考该数据集的README⽂件。分析散布在三个表中的数据可不是⼀件轻松的事情。假设我们想要根据性别和年龄计算某部电影的平均得分,如果将所有数据都合并到⼀个表中的话问题就简单多了。我们先⽤pandas的merge函数将ratings跟users合并到⼀起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键:
data = pd.merge(pd.merge(ratings, users), movies)
data        # 输出如下:
                     user_id    movie_id    rating     timestamp   gender  age   occupation          zip       \
0                            1          1193            5    978300760             F      1                 10      48067
1                            2          1193            5    978298413            M   56                  16     70072
2                          12          1193            4    978220179            M   25                  12     32793
3                          15          1193            4    978199279            M   25                    7     22903
4                          17          1193            5    978158471            M   50                    1     95350
...                          ...               ...             ...            ...                     ...    ...                   ...           ...
1000204          5949          2198            5    958846401            M   18                  17      47901
1000205          5675          2703            3    976029116            M   35                  14      30030
1000206          5780          2845            1    958153068            M   18                  17      92886
1000207          5851          3607            5    957756608             F   18                   20      55410
1000208          5938          2909            4    957273353            M   25                    1      35401
                                                                                     title                                  genres
0                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
1                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
2                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
3                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
4                         One Flew Over the Cuckoo's Nest (1975)                                 Drama
...                                                                                 ...                                  ...
1000204                                               Modulations (1998)                      Documentary
1000205                                           Broken Vessels (1998)                                 Drama
1000206                                                 White Boys (1999)                                 Drama
1000207                                         One Little Indian (1973)    Comedy|Drama|Western
1000208       Five Wives, Three Secretaries and Me (1998)                      Documentary
[1000209 rows x 10 columns]
data.iloc[0]            # 输出如下:
user_id                                                                             1
movie_id                                                                    1193
rating                                                                               5
timestamp                                                       978300760
gender                                                                             F
age                                                                                  1
occupation                                                                    10
zip                                                                            48067
title                  One Flew Over the Cuckoo's Nest (1975)
genres                                                                    Drama
Name: 0, dtype: object

为了按性别计算每部电影的平均得分,我们可以使⽤pivot_table⽅法:
mean_ratings = data.pivot_table('rating', index='title',
                                                       columns='gender', aggfunc='mean')
mean_ratings[:5]        # 输出如下:
gender                                                   F             M
title
$1,000,000 Duck (1971)            3.375000  2.761905
'Night Mother (1986)                3.388889  3.352941
'Til There Was You (1997)         2.675676  2.733333
'burbs, The (1989)                     2.793478  2.962085
...And Justice for All (1979)       3.828571  3.689024

该操作产⽣了另⼀个DataFrame,其内容为电影平均得分,⾏标为电影名称(索引),列标为性别。现在,我打算过滤掉评分数据不够250条的电影(随便选的⼀个数字)。为了达到这个⽬的,我先对title进⾏分组,然后利⽤size()得到⼀个含有各电影分组⼤⼩的Series对象:
ratings_by_title = data.groupby('title').size()             # 先分组再统计
ratings_by_title[:10]               # 输出如下:
title
$1,000,000 Duck (1971)                           37
'Night Mother (1986)                               70
'Til There Was You (1997)                        52
'burbs, The (1989)                                  303
...And Justice for All (1979)                    199
1-900 (1994)                                              2
10 Things I Hate About You (1999)       700
101 Dalmatians (1961)                           565
101 Dalmatians (1996)                           364
12 Angry Men (1957)                             616
dtype: int64
active_titles = ratings_by_title.index[ratings_by_title >= 250]         # 获取评分大于250条的索引
active_titles           # 输出如下:
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
        '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
        '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
        '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
        '2010 (1984)',
        ...
        'X-Men (2000)', 'Year of Living Dangerously (1982)',
        'Yellow Submarine (1968)', 'You've Got Mail (1998)',
        'Young Frankenstein (1974)', 'Young Guns (1988)',
        'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
        'Zero Effect (1998)', 'eXistenZ (1999)'],
       dtype='object', name='title', length=1216)

标题索引中含有评分数据⼤于250条的电影名称,然后我们就可以据此从前⾯的mean_ratings中选取所需的⾏了:
# Select rows on the index,根据索引选取行,loc方法是标签索引,iloc是整数索引
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings            # 输出如下:
gender                                                                F             M
title
'burbs, The (1989)                                  2.793478  2.962085
10 Things I Hate About You (1999)       3.646552  3.311966
101 Dalmatians (1961)                           3.791444  3.500000
101 Dalmatians (1996)                           3.240000  2.911215
12 Angry Men (1957)                             4.184397  4.328421
...                                                                     ...            ...
Young Guns (1988)                                3.371795  3.425620
Young Guns II (1990)                             2.934783  2.904025
Young Sherlock Holmes (1985)             3.514706  3.363344
Zero Effect (1998)                                  3.864407  3.723140
eXistenZ (1999)                                      3.098592  3.289086
[1216 rows x 2 columns]

为了了解⼥性观众最喜欢的电影,我们可以对F列降序排列:
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings[:10]             # 输出如下:
gender                                                                                            F             M
title
Close Shave, A (1995)                                                         4.644444  4.473795
Wrong Trousers, The (1993)                                               4.588235  4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)                    4.572650  4.464589
Wallace & Gromit: The Best of Aardman Animation...      4.563107  4.385075
Schindler's List (1993)                                                        4.562602  4.491415
Shawshank Redemption, The (1994)                                  4.539075  4.560625
Grand Day Out, A (1992)                                                    4.537879  4.293255
To Kill a Mockingbird (1962)                                             4.536667  4.372611
Creature Comforts (1990)                                                   4.513889  4.272277
Usual Suspects, The (1995)                                                4.513317  4.518248

1、计算评分分歧
假设我们想要找出男性和⼥性观众分歧最⼤的电影。⼀个办法是给mean_ratings加上⼀个⽤于存放平均得分之差的列,并对其进⾏排序:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']            # 新增一列diff
按"diff"排序即可得到分歧最⼤且⼥性观众更喜欢的电影:
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]                 # 输出如下:
                                                                    title                F             M           diff
307                                     Dirty Dancing (1987)  3.790378  2.959596 -0.830782
593                              Jumpin' Jack Flash (1986)  3.254717  2.578358 -0.676359
483                                                Grease (1978)  3.975265  3.367041 -0.608224
647                                      Little Women (1994)  3.870588  3.321739 -0.548849
1049                                Steel Magnolias (1989)  3.901734  3.365957 -0.535777
65                                               Anastasia (1997)  3.800000  3.281609 -0.518391
911         Rocky Horror Picture Show, The (1975)  3.673016  3.160131 -0.512885
238                                 Color Purple, The (1985)  4.158192  3.659341 -0.498851
28                           Age of Innocence, The (1993)  3.827068  3.339506 -0.487561
425                                             Free Willy (1993)  2.921348  2.438776 -0.482573
对排序结果反序并取出前10⾏,得到的则是男性观众更喜欢的电影:
# Reverse order of rows, take first 10 rows
sorted_by_diff[::-1][:10]           # 输出如下:
                                                                     title                F             M           diff
475      Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
602                  Kentucky Fried Movie, The (1977)  2.878788  3.555147  0.676359
331                                 Dumb & Dumber (1994)  2.697987  3.336595  0.638608
655                                 Longest Day, The (1962)  3.411765  4.031447  0.619682
193                                     Cable Guy, The (1996)  2.250000  2.863787  0.613787
362                Evil Dead II (Dead By Dawn) (1987)  3.297297  3.909283  0.611985
523                                          Hidden, The (1987)  3.137931  3.745098  0.607167
913                                                Rocky III (1982)  2.361702  2.943503  0.581801
194                                           Caddyshack (1980)  3.396135  3.969737  0.573602
417                         For a Few Dollars More (1965)  3.409091  3.953795  0.544704

如果只是想要找出分歧最⼤的电影(不考虑性别因素),则可以计算得分数据的⽅差或标准差:
# Standard deviation of rating grouped by title,计算标准差
rating_std_by_title = data.groupby('title')['rating'].std()
# Filter down to active_titles,滤除掉评分小于250条的电影
rating_std_by_title = rating_std_by_title.loc[active_titles]
# Order Series by value in descending order     # 按值降序排序
rating_std_by_title.sort_values(ascending=False)[:10]       # 输出如下:
title
Dumb & Dumber (1994)                               1.321333
Blair Witch Project, The (1999)                      1.316368
Natural Born Killers (1994)                            1.307198
Tank Girl (1995)                                             1.277695
Rocky Horror Picture Show, The (1975)        1.260177
Eyes Wide Shut (1999)                                  1.259624
Evita (1996)                                                    1.253631
Billy Madison (1995)                                      1.249970
Fear and Loathing in Las Vegas (1998)          1.246408
Bicentennial Man (1999)                                1.245533
Name: rating, dtype: float64

可能你已经注意到了,电影分类是以竖线(|)分隔的字符串形式给出的。如果想对电影分类进⾏分析的话,就需要先将其转换成更有⽤的形式才⾏。

三、1880-2010年间全美婴⼉姓名
美国社会保障总署(SSA)提供了⼀份从1880年到现在的婴⼉名字频率数据。Hadley Wickham(许多流⾏R包的作者)经常⽤这份数据来演示R的数据处理功能。

我们要做⼀些数据规整才能加载这个数据集,这么做就会产⽣⼀个如下的DataFrame:
names = pd.read_csv('datasets/babynames/yob1880.txt', names=['name', 'sex', 'births'])
names['year'] = 1880
names[:10]              # 输出如下:
              name sex  births  year
0            Mary   F    7065  1880
1            Anna   F    2604  1880
2          Emma   F    2003  1880
3      Elizabeth   F    1939  1880
4         Minnie   F    1746  1880
5      Margaret   F    1578  1880
6                Ida   F    1472  1880
7             Alice   F    1414  1880
8           Bertha   F    1320  1880
9             Sarah   F    1288  1880
你可以⽤这个数据集做很多事,例如:
             计算指定名字(可以是你⾃⼰的,也可以是别⼈的)的年度⽐例。
             计算某个名字的相对排名。
             计算各年度最流⾏的名字,以及增⻓或减少最快的名字。
             分析名字趋势:元⾳、辅⾳、⻓度、总体多样性、拼写变
             化、⾸尾字⺟等。
             分析外源性趋势:圣经中的名字、名⼈、⼈⼝结构变化等。
利⽤前⾯介绍过的那些⼯具,这些分析⼯作都能很轻松地完成,
这里会讲解其中的⼀些。

目前而言,美国社会保障总署将该数据库按年度制成了多个数据⽂件,其中给出了每个性别/名字组合的出⽣总数。这些⽂件的原始档案可以在这⾥获取:http://www.ssa.gov/oact/babynames/limits.html。

下载"National data"⽂件names.zip,解压后的⽬录中含有⼀组⽂件(如yob1880.txt)。⽤UNIX的head命令查看了其中⼀个⽂件的前10⾏(在Windows上,你可以⽤more命令,或直接在⽂本编辑器中打开):
# windows cmd: !more datasets\\babynames\\yob1880.txt -n 10
!head -n 10 datasets/babynames/yob1880.txt      # 输出如下:
Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288

由于这是⼀个⾮常标准的以逗号隔开的格式,所以可以⽤pandas.read_csv将其加载到DataFrame中:
names1880 = pd.read_csv('datasets/babynames/yob1880.txt',
                                              names=['name', 'sex', 'births'])
names1880               # 输出如下:
                      name sex  births
0                    Mary   F    7065
1                    Anna   F    2604
2                  Emma   F    2003
3             Elizabeth   F    1939
4                Minnie   F    1746
...                    ...        ..        ...
1995        Woodie   M       5
1996         Worthy   M       5
1997          Wright   M       5
1998              York   M       5
1999     Zachariah   M       5
[2000 rows x 3 columns]

这些⽂件中仅含有当年出现超过5次的名字。为了简单起⻅,我们可以⽤births列的sex分组⼩计表示该年度的births总计:
names1880.groupby('sex').births.sum()           # 输出如下:
sex
F        90993
M    110493
Name: births, dtype: int64

由于该数据集按年度被分隔成了多个⽂件,所以第⼀件事情就是要将所有数据都组装到⼀个DataFrame⾥⾯,并加上⼀个year字段。使⽤pandas.concat即可达到这个⽬的:
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
# 使用循环将每个文件内容读取列表中
for year in years:
       path = 'datasets/babynames/yob%d.txt' % year
       frame = pd.read_csv(path, names=columns)
       frame['year'] = year
       pieces.append(frame)
      
# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)     # ignore_index=True,不保留原始行号
这⾥需要注意⼏件事情。第⼀,concat默认是按⾏将多个DataFrame组合到⼀起的;第⼆,必须指定ignore_index=True,因为我们不希望保留read_csv所返回的原始⾏号。现在我们得到了⼀个⾮常⼤的DataFrame,它含有全部的名字数据:
names                   # 输出如下:
                           name sex  births  year
0                         Mary   F    7065  1880
1                         Anna   F    2604  1880
2                       Emma   F    2003  1880
3                   Elizabeth   F    1939  1880
4                      Minnie   F    1746  1880
...                              ...    ..        ...      ...
1690779        Zymaire   M       5  2010
1690780         Zyonne   M       5  2010
1690781     Zyquarius   M       5  2010
1690782            Zyran   M       5  2010
1690783            Zzyzx   M       5  2010
[1690784 rows x 4 columns]

有了这些数据之后,我们就可以利⽤groupby或pivot_table在year和sex级别上对其进⾏聚合了,如图14-4所示:
# 行分组是year列,列分组是sex列,被计数列是births列,这里应用于计数的方式是求和
total_births = names.pivot_table('births', index='year',
                                                       columns='sex', aggfunc=sum)
total_births.tail()      # 输出如下:(最后5行)
sex                F            M
year
2006  1896468  2050234
2007  1916888  2069242
2008  1883645  2032310
2009  1827643  1973359
2010  1759010  1898382
total_births.plot(title='Total births by sex and year')     # 输出图形14-4

数据分析项目 数据分析项目经验案例_数据_04                                    图14-4  按性别和年度统计的总出⽣数

下⾯我们来插⼊⼀个prop列,⽤于存放指定名字的婴⼉数相对于总出⽣数的⽐例。prop值为0.02表示每100名婴⼉中有2名取了当前这个名字。因此,我们先按year和sex分组,然后再将新列加到各个分组上:
def add_prop(group):
       group['prop'] = group.births / group.births.sum()
       return group
names = names.groupby(['year', 'sex']).apply(add_prop)
现在,完整的数据集就有了下⾯这些列:
names                   # 输出如下:
                            name sex  births  year      prop
0                          Mary   F    7065  1880  0.077643
1                          Anna   F    2604  1880  0.028618
2                        Emma   F    2003  1880  0.022013
3                    Elizabeth   F    1939  1880  0.021309
4                       Minnie   F    1746  1880  0.019188
...                             ...      ..        ..      ...           ...
1690779         Zymaire   M        5   2010  0.000003
1690780          Zyonne   M        5   2010  0.000003
1690781      Zyquarius   M        5   2010  0.000003
1690782             Zyran   M        5   2010  0.000003
1690783             Zzyzx   M        5   2010  0.000003
[1690784 rows x 5 columns]

在执⾏这样的分组处理时,⼀般都应该做⼀些有效性检查,⽐如验证所有分组的prop的总和是否为1:
names.groupby(['year', 'sex']).prop.sum()        # 输出如下:
year     sex
1880        F      1.0
              M      1.0
1881        F      1.0
              M      1.0
1882        F      1.0
              ...
2008     M        1.0
2009       F       1.0
              M      1.0
2010       F      1.0
              M      1.0
Name: prop, Length: 262, dtype: float64

⼯作完成。为了便于实现更进⼀步的分析,我需要取出该数据的⼀个⼦集:每对sex/year组合的前1000个名字。
这⼜是⼀个分组操作:
def get_top1000(group, column):     # 这里参数调用可能有问题,将column换成'births'实验
       return group.sort_values(by=column, ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])        # 先按年和性别分组
top1000 = grouped.apply(get_top1000, 'births')    # 对分组应用get_top1000函数
top1000.reset_index(inplace=True, drop=True)

如果你喜欢DIY的话,也可以这样:
pieces = []
for year, group in names.groupby(['year', 'sex']):
       pieces.append(group.sort_values(by='births', ascending=False)[:1000])
top1000 = pd.concat(pieces, ignore_index=True)

现在的结果数据集就⼩多了:
top1000                 # 输出如下:
                        name sex  births  year         prop
0                      Mary    F    7065  1880  0.038822
1                      Mary    F    7065  1880  0.038822
2                      Anna    F    2604  1880  0.014309
3                      Anna    F    2604  1880  0.014309
4                    Emma    F    2003  1880  0.011006
...                         ...      ..        ...        ...          ...
261995       Jamison   M     515  2010  0.000136
261996      Deandre   M     513  2010  0.000135
261997      Deandre   M     513  2010  0.000135
261998         Dorian   M     512  2010  0.000135
261999         Dorian   M     512  2010  0.000135
[262000 rows x 5 columns]
接下来的数据分析⼯作就针对这个top1000数据集了。

1、分析命名趋势
有了完整的数据集和刚才⽣成的top1000数据集,我们就可以开始分析各种命名趋势了。⾸先将前1000个名字分为男⼥两个部分:
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']

这是两个简单的时间序列,只需稍作整理即可绘制出相应的图表(⽐如每年叫做John和Mary的婴⼉数)。我们先⽣成⼀张按year和name统计的总出⽣数透视表:
total_births = top1000.pivot_table('births', index='year',
                                                           columns='name',
                                                            aggfunc=sum)

现在,我们⽤DataFrame的plot⽅法绘制⼏个名字的曲线图(⻅图14-5):
total_births.info()     # 输出如下:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 3118 entries, Aaden to Zula
dtypes: float64(3118)
memory usage: 3.1 MB

subset = total_births[['John', 'Harry', 'Mary', 'Marilyn']]
subset.plot(subplots=True, figsize=(12, 10), grid=False,
                    title="Number of births per year")       # 输出图形14-5

数据分析项目 数据分析项目经验案例_数据分析项目_05                                图14-5  ⼏个男孩和⼥孩名字随时间变化的使⽤数量
从图中可以看出,这⼏个名字在美国⼈⺠的⼼⽬中已经⻛光不再了。但事实并⾮如此简单,我们在下⼀节中就能知道是怎么⼀回事了。

2、评估命名多样性的增⻓
⼀种解释是⽗⺟愿意给⼩孩起常⻅的名字越来越少。这个假设可以从数据中得到验证。⼀个办法是计算最流⾏的1000个名字所占的⽐例,我按year和sex进⾏聚合并绘图(⻅图14-6):
table = top1000.pivot_table('prop', index='year',
                                                columns='sex', aggfunc=sum)
table.plot(title='Sum of table1000.prop by year and sex',
                  yticks=np.linspace(0, 1.2, 13),
                  xticks=range(1880, 2020, 10))              # 输出图形14-6

数据分析项目 数据分析项目经验案例_数据_06                                 图14-6  分性别统计的前1000个名字在总出⽣⼈数中的⽐例

从图中可以看出,名字的多样性确实出现了增⻓(前1000项的⽐例降低)。另⼀个办法是计算占总出⽣⼈数前50%的不同名字的数量,这个数字不太好计算。我们只考虑2010年男孩的名字:
df = boys[boys.year == 2010]
df          # 输出如下:
                       name sex   births   year        prop
261000          Jacob   M   21875  2010  0.005761
261001          Jacob   M   21875  2010  0.005761
261002           Ethan   M   17866  2010  0.004706
261003           Ethan   M   17866  2010  0.004706
261004       Michael   M   17133  2010  0.004513
...                     ...          ..        ...       ...          ...
261995       Jamison   M       515  2010  0.000136
261996      Deandre   M       513  2010  0.000135
261997      Deandre   M       513  2010  0.000135
261998         Dorian   M       512  2010  0.000135
261999         Dorian   M       512  2010  0.000135
[1000 rows x 5 columns]

在对prop降序排列之后,我们想知道前⾯多少个名字的⼈数加起来才够50%。虽然编写⼀个for循环确实也能达到⽬的,但NumPy有⼀种更聪明的⽮量⽅式。先计算prop的累计和cumsum,然后再通过searchsorted⽅法找出0.5应该被插⼊在哪个位置才能保证不破坏顺序:
prop_cumsum = df.sort_values(by='prop', ascending=False).prop.cumsum()
prop_cumsum[:10]        # 输出如下:
261000    0.005761
261001    0.011523
261002    0.016229
261003    0.020934
261004    0.025447
261005    0.029959
261006    0.034445
261007    0.038930
261008    0.043373
261009    0.047817
Name: prop, dtype: float64

prop_cumsum.values.searchsorted(0.5)            # 输出:233
由于数组索引是从0开始的,因此我们要给这个结果加1,即最终结果为234。拿1900年的数据来做个⽐较,
这个数字要⼩得多:
in1900 = df.sort_values(by='prop', ascending=False).prop.cumsum()
in1900.values.searchsorted(0.5) + 1             # 输出:50

现在就可以对所有year/sex组合执⾏这个计算了。按这两个字段进⾏groupby处理,然后⽤⼀个函数计算各分组的这个值:
def get_quantile_count(group, q=0.5):
       group = group.sort_values(by='prop', ascending=False)
       return group.prop.cumsum().values.searchsorted(q) + 1
      
diversity = top1000.groupby(['year', 'sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')

现在,diversity这个DataFrame拥有两个时间序列(每个性别各⼀个,按年度索引)。通过IPython,你可以查看其内容,还可以像之前那样绘制图表(如图14-7所示):
diversity.head()         # 输出如下:
sex      F   M
year
1880  75  27
1881  75  27
1882  76  29
1883  77  29
1884  78  31

diversity.plot(title='Number of popular names in top 50%', grid=True)   # 输出图形14-7

数据分析项目 数据分析项目经验案例_数据分析项目_07                                          图14-7  按年度统计的密度表

从图中可以看出,⼥孩名字的多样性总是⽐男孩的⾼,⽽且还在变得越来越⾼。可自行分析⼀下具体是什么在驱动这个多样性(⽐如拼写形式的变化)。

3、“最后⼀个字⺟”的变⾰
2007年,⼀名婴⼉姓名研究⼈员Laura Wattenberg在她⾃⼰的⽹站上指出(http://www.babynamewizard.com):近百年来,男孩名字在最后⼀个字⺟上的分布发⽣了显著的变化。为了了解具体的情况,我⾸先将全部出⽣数据在年度、性别以及末字⺟上进⾏了聚合:
# extract last letter from name column
get_last_letter = lambda x: x[-1]            # 取最后一个字母
last_letters = names.name.map(get_last_letter)     # 对于Series使用map()方法,DataFrame使用apply()方法
last_letters.name = 'last_letter'
table = names.pivot_table('births', index=last_letters,
                                              columns=['sex', 'year'], aggfunc=sum)

然后,我选出具有⼀定代表性的三年,并输出前⾯⼏⾏:
subtable = table.reindex(columns=[1910, 1960, 2010], level='year')
subtable.head()                      # 输出如下:
sex                            F                                              M
year                     1910          1960           2010      1910          1960        2010
last_letter
a                   108376.0    691247.0    670605.0      977.0       5204.0   28438.0
b                          NaN          694.0          450.0      411.0       3912.0   38859.0
c                             5.0            49.0          946.0      482.0     15476.0   23125.0
d                       6750.0        3729.0        2607.0  22111.0   262112.0   44398.0
e                   133569.0    435013.0    313833.0  28655.0   178823.0  129012.0

接下来我们需要按总出⽣数对该表进⾏规范化处理,以便计算出各性别各末字⺟占总出⽣⼈数的⽐例:
subtable.sum()          # 按列求和, 输出如下:
sex    year
F      1910      396416.0
         1960    2022062.0
         2010    1759010.0
M    1910       194198.0
        1960     2132588.0
        2010     1898382.0
dtype: float64

letter_prop = subtable / subtable.sum()         # 广播
letter_prop              # 输出如下:
sex                           F                                               M
year                    1910        1960         2010         1910         1960        2010
last_letter
a                  0.273390  0.341853  0.381240  0.005031  0.002440  0.014980
b                         NaN  0.000343  0.000256  0.002116  0.001834  0.020470
c                  0.000013  0.000024  0.000538  0.002482  0.007257  0.012181
d                  0.017028  0.001844  0.001482  0.113858  0.122908  0.023387
e                  0.336941  0.215133  0.178415  0.147556  0.083853  0.067959
...                          ...                 ...            ...                 ...           ...              ...
v                         NaN  0.000060  0.000117  0.000113  0.000037  0.001434
w                 0.000020  0.000031  0.001182  0.006329  0.007711  0.016148
x                  0.000015  0.000037  0.000727  0.003965  0.001851  0.008614
y                  0.110972  0.152569  0.116828  0.077349  0.160987  0.058168
z                  0.002439  0.000659  0.000704  0.000170  0.000184  0.001831
[26 rows x 6 columns]

有了这个字⺟⽐例数据之后,就可以⽣成⼀张各年度各性别的条形图了,如图14-8所示:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',
                                  legend=False)              # 输出图形14-8
plt.subplots_adjust(hspace=0.32)

数据分析项目 数据分析项目经验案例_Python_08                                   图14-8  男孩⼥孩名字中各个末字⺟的⽐例

可以看出,从20世纪60年代开始,以字⺟"n"结尾的男孩名字出现了显著的增⻓。回到之前创建的那个完整表,按年度和性别对其进⾏规范化处理,并在男孩名字中选取⼏个字⺟,最后进⾏转置以便将各个列做成⼀个时间序列:
letter_prop = table / table.sum()
dny_ts = letter_prop.loc[['d', 'n', 'y'], 'M'].T
dny_ts.head()           # 输出如下:
last_letter             d              n               y
year
1880         0.083055  0.153213  0.075760
1881         0.083247  0.153214  0.077451
1882         0.085340  0.149560  0.077537
1883         0.084066  0.151646  0.079144
1884         0.086120  0.149915  0.080405

有了这个时间序列的DataFrame之后,就可以通过其plot⽅法绘制出⼀张趋势图了(如图14-9所示):
dny_ts.plot()           # 输出图形14-9

数据分析项目 数据分析项目经验案例_数据分析项目_09                               图14-9  各年出⽣的男孩中名字以d/n/y结尾的⼈数⽐例

4、变成⼥孩名字的男孩名字(以及相反的情况)
另⼀个有趣的趋势是,早年流⾏于男孩的名字近年来“变性了”,例如Lesley或Leslie。回到top1000数据集,找出其中以"lesl"开头的⼀组名字:
all_names = pd.Series(top1000.name.unique())
lesley_like = all_names[all_names.str.lower().str.contains('lesl')]
lesley_like             # 输出如下:
632       Leslie
2294    Lesley
4262    Leslee
4728      Lesli
6103      Lesly
dtype: object

然后利⽤这个结果过滤其他的名字,并按名字分组计算出⽣数以查看相对频率:
filtered = top1000[top1000.name.isin(lesley_like)]
filtered.groupby('name').births.sum()           # 输出如下:
name
Leslee       1082
Lesley     35022
Lesli            929
Leslie    370429
Lesly       10067
Name: births, dtype: int64

接下来,我们按性别和年度进⾏聚合,并按年度进⾏规范化处理:
table = filtered.pivot_table('births', index='year',
                                              columns='sex', aggfunc='sum')
table = table.div(table.sum(1), axis=0)
table.tail()            # 输出如下:
sex        F     M
year
2006  1.0 NaN
2007  1.0 NaN
2008  1.0 NaN
2009  1.0 NaN
2010  1.0 NaN

最后,就可以轻松绘制⼀张分性别的年度曲线图了(如图14-10所示):
table.plot(style={'M': 'k-', 'F': 'r--'})       # 输出图形14-10

数据分析项目 数据分析项目经验案例_Python_10                                    图14-10  各年度使⽤“Lesley型”名字的男⼥⽐例

四、USDA⻝品数据库
美国农业部(USDA)制作了⼀份有关⻝物营养信息的数据库。Ashley Williams制作了该数据的JSON版(http://ashleyw.co.uk/project/food-nutrient-database)。其中的记录如下所示:
{
             "id": 21441,
             "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA Wing, meat and skin with breading",
             "tags": ["KFC"],
             "manufacturer": "Kentucky Fried Chicken",
             "group": "Fast Foods",
             "portions": [
                 {
                         "amount": 1,
                         "unit": "wing, with skin",
                         "grams": 68.0
                 },
                 ...
             ],
             "nutrients": [
                 {
                         "value": 20.8,
                         "units": "g",
                         "description": "Protein",
                         "group": "Composition"
                 },
                 ...
             ]
}
每种⻝物都带有若⼲标识性属性以及两个有关营养成份和分量的列表。这种形式的数据不是很适合分析⼯作,因此
需要做⼀些规整化以使其具有更好⽤的形式。

从上⾯列举的⽹址下载并解压数据之后,⽤Python内置的json模块将其加载到Python中。
import json
db = json.load(open('datasets/usda_food/database.json'))
len(db)                 # 输出:6636

db中的每个条⽬都是⼀个含有某种⻝物全部数据的字典。nutrients(分量)字段是⼀个字典列表,其中的每个字典对应⼀种营养成分:
db[0].keys()            # 访问字典的键,输出如下:
dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])
db[0]['nutrients'][0]               # 输出如下:
{'value': 25.18,
  'units': 'g',
  'description': 'Protein',
  'group': 'Composition'}
nutrients = pd.DataFrame(db[0]['nutrients'])
nutrients[:7]
                                  description             group units    value
0                                      Protein  Composition     g      25.18
1                          Total lipid (fat)  Composition     g      29.20
2     Carbohydrate, by difference  Composition     g        3.06
3                                             Ash             Other     g        3.28
4                                        Energy           Energy  kcal   376.00
5                                         Water  Composition     g      39.28
6                                        Energy           Energy     kJ  1573.00

在将字典列表转换为DataFrame时,可以只抽取其中的⼀部分字段。这⾥,我们将取出⻝物的名称、分类、编号以及制造商等信息:
info_keys = ['description', 'group', 'id', 'manufacturer']
info = pd.DataFrame(db, columns=info_keys)
info[:5]                # 输出如下:
                                            description                             group       id      \
0                                 Cheese, caraway  Dairy and Egg Products  1008
1                                 Cheese, cheddar  Dairy and Egg Products  1009
2                                     Cheese, edam  Dairy and Egg Products  1018
3                                        Cheese, feta  Dairy and Egg Products  1019
4    Cheese, mozzarella, part skim milk  Dairy and Egg Products  1028
       manufacturer
0
1
2
3
4
info.info()              # 输出如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
description         6636 non-null object
group                  6636 non-null object
id                        6636 non-null int64
manufacturer      5195 non-null object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB

通过value_counts,可以查看⻝物类别的分布情况:
pd.value_counts(info.group)[:10]                 # 输出如下:
Vegetables and Vegetable Products      812
Beef Products                                          618
Baked Products                                       496
Breakfast Cereals                                     403
Fast Foods                                               365
Legumes and Legume Products             365
Lamb, Veal, and Game Products             345
Sweets                                                     341
Pork Products                                          328
Fruits and Fruit Juices                              328
Name: group, dtype: int64

现在,为了对全部营养数据做⼀些分析,最简单的办法是将所有⻝物的营养成分整合到⼀个⼤表中。我们分⼏个步骤来实现该⽬的。⾸先,将各⻝物的营养成分列表转换为⼀个DataFrame,并添加⼀个表示编号的列,然后将该DataFrame添加到⼀个列表中。最后通过concat将这些东⻄连接起来就可以了:
  keys = ['id']
  def get_data(data, keys):          # 使用函数方式提取数据
         pieces = []
         N = len(data)
         ids = pd.DataFrame(data, columns=keys)
         for i in range(N):
               frame = pd.DataFrame(data[i]['nutrients'])
               frame['id'] = ids.iloc[i].values[0]
               pieces.append(frame)
         return pieces
pieces = get_data(db, keys)         # 调用函数
nutrients = pd.concat(pieces, ignore_index=True)            # 合并数据
nutrients               # 输出如下:
                                                       description                 group    units    value       id
0                                                           Protein      Composition         g    25.180   1008
1                                               Total lipid (fat)      Composition         g    29.200   1008
2                          Carbohydrate, by difference      Composition         g      3.060   1008
3                                                                 Ash                  Other         g      3.280   1008
4                                                            Energy                 Energy    kcal  376.000   1008
...                                                         ...                      ...           ...          ...           ...
389350                            Vitamin B-12, added              Vitamins   mcg    0.000   43546
389351                                           Cholesterol                   Other     mg    0.000   43546
389352                   Fatty acids, total saturated                   Other        g    0.072   43546
389353      Fatty acids, total monounsaturated                   Other        g    0.028   43546
389354        Fatty acids, total polyunsaturated                    Other       g    0.041   43546
[389355 rows x 5 columns]

我发现这个DataFrame中⽆论如何都会有⼀些重复项,所以直接丢弃就可以了:
# number of duplicates
nutrients.duplicated().sum()        # 输出:14179
nutrients = nutrients.drop_duplicates()

由于两个DataFrame对象中都有"group"和"description",所以为了明确到底谁是谁,我们需要对它们进⾏重命名:
col_mapping = {'description' : 'food',
                            'group'          : 'fgroup'}
info = info.rename(columns=col_mapping, copy=False)
info.info()             # 输出如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
food                    6636 non-null object
fgroup                 6636 non-null object
id                         6636 non-null int64
manufacturer       5195 non-null object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB

col_mapping = {'description' : 'nutrient',
                            'group'          : 'nutgroup'}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
nutrients               # 输出如下:
                                                             nutrient            nutgroup    units   value       id
0                                                           Protein      Composition         g    25.180   1008
1                                               Total lipid (fat)      Composition         g    29.200   1008
2                          Carbohydrate, by difference      Composition         g      3.060   1008
3                                                                 Ash                  Other         g      3.280   1008
4                                                            Energy                 Energy    kcal  376.000   1008
...                                                         ...                      ...           ...          ...           ...
389350                            Vitamin B-12, added              Vitamins   mcg    0.000   43546
389351                                           Cholesterol                   Other     mg    0.000   43546
389352                   Fatty acids, total saturated                   Other        g    0.072   43546
389353      Fatty acids, total monounsaturated                   Other        g    0.028   43546
389354        Fatty acids, total polyunsaturated                    Other       g    0.041   43546
[389355 rows x 5 columns]

做完这些,就可以将info跟nutrients合并起来:
ndata = pd.merge(nutrients, info, on='id', how='outer')
ndata.info()            # 输出如下:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 375176 entries, 0 to 375175
Data columns (total 8 columns):
nutrient                375176 non-null object
nutgroup              375176 non-null object
units                     375176 non-null object
value                     375176 non-null float64
id                          375176 non-null int64
food                     375176 non-null object
fgroup                  375176 non-null object
manufacturer       293054 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 25.8+ MB

ndata.iloc[30000]       # 输出如下:
nutrient                                                                      Glycine
nutgroup                                                           Amino Acids
units                                                                                     g
value                                                                                0.04
id                                                                                    6158
food                    Soup, tomato bisque, canned, condensed
fgroup                                        Soups, Sauces, and Gravies
manufacturer
Name: 30000, dtype: object

我们现在可以根据⻝物分类和营养类型画出⼀张中位值图(如图14-11所示):
result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)
result['Zinc, Zn'].sort_values().plot(kind='barh')          # 输出图形14-11

数据分析项目 数据分析项目经验案例_Windows_11                                           图⽚14-11  根据营养分类得出的锌中位值

只要稍微想一想,就可以发现各营养成分最为丰富的⻝物是什么了:
by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])
get_maximum = lambda x: x.loc[x.value.idxmax()]
get_minimum = lambda x: x.loc[x.value.idxmin()]
max_foods = by_nutrient.apply(get_maximum)[['value', 'food']]
# make the food a little smaller
max_foods.food = max_foods.food.str[:50]

由于得到的DataFrame很⼤,所以不⽅便全部打印出来。这⾥只给出"Amino Acids"营养分组:
max_foods.loc['Amino Acids']['food']            # 输出如下:
nutrient
Alanine                                           Gelatins, dry powder, unsweetened
Arginine                                                     Seeds, sesame flour, low-fat
Aspartic acid                                                             Soy protein isolate
Cystine                               Seeds, cottonseed flour, low fat (glandless)
Glutamic acid                                                            Soy protein isolate
                                                             ...
Serine                  Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine            Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan                      Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine               Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine                   Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food, Length: 19, dtype: object

五、2012联邦选举委员会数据库
美国联邦选举委员会发布了有关政治竞选赞助⽅⾯的数据。其中包括赞助者的姓名、职业、雇主、地址以及出资额等信息。我们对2012年美国总统⼤选的数据集⽐较感兴趣(http://www.fec.gov/disclosurep/PDownload.do)。2012年6⽉下载的数据集是⼀个150MB的CSV⽂件(P00000001-ALL.csv),先⽤pandas.read_csv将其加载进来:
fec = pd.read_csv('datasets/fec/P00000001-ALL.csv')
fec.info()              # 输出如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id                        1001731 non-null object
cand_id                        1001731 non-null object
cand_nm                      1001731 non-null object
contbr_nm                   1001731 non-null object
contbr_city                   1001712 non-null object
contbr_st                      1001727 non-null object
contbr_zip                    1001620 non-null object
contbr_employer           988002 non-null object
contbr_occupation         993301 non-null object
contb_receipt_amt       1001731 non-null float64
contb_receipt_dt          1001731 non-null object
receipt_desc                     14166 non-null object
memo_cd                         92482 non-null object
memo_text                       97770 non-null object
form_tp                        1001731 non-null object
file_num                       1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB

该DataFrame中的记录如下所示:
fec.iloc[123456]        # 输出如下:
cmte_id                       C00431445
cand_id                       P80003338
cand_nm               Obama, Barack
contbr_nm               ELLMAN, IRA
contbr_city                         TEMPE
                     ...
receipt_desc                          NaN
memo_cd                              NaN
memo_text                            NaN
form_tp                              SA17A
file_num                            772372
Name: 123456, Length: 16, dtype: object

有许多办法从这些竞选赞助数据中抽取有关赞助⼈和赞助模式的统计信息。我将在接下来的内容中介绍⼏种不同的分析⼯作(运⽤到⽬前为⽌已经学到的⽅法)。

不难看出,该数据中没有党派信息,因此最好把它加进去。通过unique,你可以获取全部的候选⼈名单:
unique_cands = fec.cand_nm.unique()
unique_cands            # 输出如下:
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
           "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
           'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
           'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
           'Huntsman, Jon', 'Perry, Rick'], dtype=object)
unique_cands[2]         # 输出:'Obama, Barack'

指明党派信息的⽅法之⼀是使⽤字典:
parties = {'Bachmann, Michelle': 'Republican',
                  'Cain, Herman': 'Republican',
                  'Gingrich, Newt': 'Republican',
                  'Huntsman, Jon': 'Republican',
                  'Johnson, Gary Earl': 'Republican',
                  'McCotter, Thaddeus G': 'Republican',
                  'Obama, Barack': 'Democrat',
                  'Paul, Ron': 'Republican',
                  'Pawlenty, Timothy': 'Republican',
                  'Perry, Rick': 'Republican',
                  "Roemer, Charles E. 'Buddy' III": 'Republican',
                  'Romney, Mitt': 'Republican',
                  'Santorum, Rick': 'Republican'}
现在,通过这个映射以及Series对象的map⽅法,你可以根据候选⼈姓名得到⼀组党派信息:
fec.cand_nm[123456:123461]          # 输出如下:
123456    Obama, Barack
123457    Obama, Barack
123458    Obama, Barack
123459    Obama, Barack
123460    Obama, Barack
Name: cand_nm, dtype: object
fec.cand_nm[123456:123461].map(parties)         # 找出对应的党派信息,输出如下:
123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object

# Add it as a column,添加一列党派列

fec['party'] = fec.cand_nm.map(parties)
fec['party'].value_counts()         # 输出如下:
Democrat      593746
Republican    407985
Name: party, dtype: int64

这⾥有两个需要注意的地⽅。第⼀,该数据既包括赞助也包括退款(负的出资额):
(fec.contb_receipt_amt > 0).value_counts()
True     991475
False      10256
Name: contb_receipt_amt, dtype: int64

为了简化分析过程,我限定该数据集只能有正的出资额:
fec = fec[fec.contb_receipt_amt > 0]

由于Barack Obama和Mitt Romney是最主要的两名候选⼈,所以我还专⻔准备了⼀个⼦集,只包含针对他们两⼈的竞选活动的赞助信息:
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]

1、根据职业和雇主统计赞助信息
基于职业的赞助信息统计是另⼀种经常被研究的统计任务。例如,律师们更倾向于资助⺠主党,⽽企业主则更倾向于资助共和党。你可以不相信我,⾃⼰看那些数据就知道了。⾸先,根据职业计算出资总额,这很简单:
fec.contbr_occupation.value_counts()[:10]                   # 输出如下:
RETIRED                                                                   233990
INFORMATION REQUESTED                                     35107
ATTORNEY                                                                34286
HOMEMAKER                                                            29931
PHYSICIAN                                                                23432
INFORMATION REQUESTED PER BEST EFFORTS      21138
ENGINEER                                                                 14334
TEACHER                                                                   13990
CONSULTANT                                                           13273
PROFESSOR                                                               12555
Name: contbr_occupation, dtype: int64

不难看出,许多职业都涉及相同的基本⼯作类型,或者同⼀样东⻄有多种变体。下⾯的代码⽚段可以清理⼀些这样的数据(将⼀个职业信息映射到另⼀个)。注意,这⾥巧妙地利⽤了dict.get,它允许没有映射关系的职业也能“通过”:
occ_mapping = {
             'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
             'INFORMATION REQUESTED' : 'NOT PROVIDED',
             'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED',
             'C.E.O.': 'CEO'
}
# If no mapping provided, return x
f = lambda x: occ_mapping.get(x, x)
fec.contbr_occupation = fec.contbr_occupation.map(f)

对雇主信息做同样的处理:
emp_mapping = {
             'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
             'INFORMATION REQUESTED' : 'NOT PROVIDED',
             'SELF' : 'SELF-EMPLOYED',
             'SELF EMPLOYED' : 'SELF-EMPLOYED',
}
# If no mapping provided, return x
f = lambda x: emp_mapping.get(x, x)
fec.contbr_employer = fec.contbr_employer.map(f)

现在,你可以通过pivot_table根据党派和职业对数据进⾏聚合,然后过滤掉总出资额不⾜200万美元的数据:
by_occupation = fec.pivot_table('contb_receipt_amt',
                                                       index='contbr_occupation',
                                                       columns='party', aggfunc='sum')
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
over_2mm                 # 输出如下:
party                              Democrat       Republican
contbr_occupation
ATTORNEY                 11141982.97  7.477194e+06
CEO                              2074974.79  4.211041e+06
CONSULTANT              2459912.71  2.544725e+06
ENGINEER                      951525.55  1.818374e+06
EXECUTIVE                   1355161.05  4.138850e+06
...                                             ...                     ...
PRESIDENT                   1878509.95  4.720924e+06
PROFESSOR                  2165071.08  2.967027e+05
REAL ESTATE                   528902.09  1.625902e+06
RETIRED                       25305116.38  2.356124e+07
SELF-EMPLOYED              672393.40  1.640253e+06
[17 rows x 2 columns]

把这些数据做成柱状图看起来会更加清楚('barh'表示⽔平柱状图,如图14-12所示):
over_2mm.plot(kind='barh')          # 输出图形14-12

数据分析项目 数据分析项目经验案例_数据_12                                      图14-12  对各党派总出资额最⾼的职业

你可能还想了解⼀下对Obama和Romney总出资额最⾼的职业和企业。为此,我们先对候选⼈进⾏分组,然后使⽤本章前⾯介绍的类似top的⽅法:
def get_top_amounts(group, key, n=5):
       totals = group.groupby(key)['contb_receipt_amt'].sum()
       return totals.nlargest(n)
然后根据职业和雇主进⾏聚合:
grouped = fec_mrbo.groupby('cand_nm')    # 根据被选举人分组
grouped.apply(get_top_amounts, 'contbr_occupation', n=7)    # 输出如下:
cand_nm               contbr_occupation
Obama, Barack     RETIRED                                          25305116.38
                              ATTORNEY                                     11141982.97
                              INFORMATION REQUESTED           4866973.96
                              HOMEMAKER                                  4248875.80
                              PHYSICIAN                                       3735124.94
                                                                                               ...
Romney, Mitt       HOMEMAKER                                    8147446.22
                              ATTORNEY                                       5364718.82
                              PRESIDENT                                      2491244.89
                              EXECUTIVE                                       2300947.03
                              C.E.O.                                               1968386.11
Name: contb_receipt_amt, Length: 14, dtype: float64

grouped.apply(get_top_amounts, 'contbr_employer', n=10)     # 输出如下:
cand_nm               contbr_employer
Obama, Barack     RETIRED                                       22694358.85
                              SELF-EMPLOYED                         17080985.96
                              NOT EMPLOYED                          8586308.70
                              INFORMATION REQUESTED       5053480.37
                              HOMEMAKER                              2605408.54
                                                                                             ...
Romney, Mitt       CREDIT SUISSE                               281150.00
                              MORGAN STANLEY                       267266.00
                              GOLDMAN SACH & CO.                238250.00
                              BARCLAYS CAPITAL                       162750.00
                              H.I.G. CAPITAL                                139500.00
Name: contb_receipt_amt, Length: 20, dtype: float64

2、对出资额分组
还可以对该数据做另⼀种⾮常实⽤的分析:利⽤cut函数根据出资额的⼤⼩将数据离散化到多个⾯元中:
bins = np.array([0, 1, 10, 100, 1000, 10000,
                           100000, 1000000, 10000000])
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)
labels                  # 输出如下:
411              (10, 100]
412          (100, 1000]
413          (100, 1000]
414              (10, 100]
415              (10, 100]
              ...
701381        (10, 100]
701382    (100, 1000]
701383            (1, 10]
701384        (10, 100]
701385    (100, 1000]
Name: contb_receipt_amt, Length: 694282, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] <
                                   (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]

现在可以根据候选⼈姓名以及⾯元标签对奥巴⻢和罗姆尼数据进⾏分组,以得到⼀个柱状图:
grouped = fec_mrbo.groupby(['cand_nm', labels])    # 先分组,再统计
grouped.size().unstack(0)
cand_nm                    Obama, Barack     Romney, Mitt
contb_receipt_amt
(0, 1]                                         493.0                     77.0
(1, 10]                                   40070.0                 3681.0
(10, 100]                             372280.0               31853.0
(100, 1000]                         153991.0               43357.0
(1000, 10000]                       22284.0               26186.0
(10000, 100000]                           2.0                       1.0
(100000, 1000000]                       3.0                     NaN
(1000000, 10000000]                   4.0                     NaN

从这个数据中可以看出,在⼩额赞助⽅⾯,Obama获得的数量⽐Romney多得多。你还可以对出资额求和并在⾯元内规格化,以便图形化显示两位候选⼈各种赞助额度的⽐例(⻅图14-13):
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
normed_sums = bucket_sums.div(bucket_sums.sum(axis=1), axis=0)
normed_sums             # 输出如下:
cand_nm                        Obama, Barack    Romney, Mitt
contb_receipt_amt
(0, 1]                                        0.805182           0.194818
(1, 10]                                      0.918767           0.081233
(10, 100]                                  0.910769           0.089231
(100, 1000]                              0.710176           0.289824
(1000, 10000]                          0.447326           0.552674
(10000, 100000]                      0.823120           0.176880
(100000, 1000000]                  1.000000                   NaN
(1000000, 10000000]              1.000000                   NaN
normed_sums[:-2].plot(kind='barh')     # 输出图形14-13

数据分析项目 数据分析项目经验案例_数据分析项目_13                             图14-13  两位候选⼈收到的各种捐赠额度的总额⽐例

这里排除了两个最⼤的⾯元,因为这些不是由个⼈捐赠的。

还可以对该分析过程做许多的提炼和改进。⽐如说,可以根据赞助⼈的姓名和邮编对数据进⾏聚合,以便找出哪些⼈进⾏了多次⼩额捐款,哪些⼈⼜进⾏了⼀次或多次⼤额捐款。应该对这些数据探索⼀下。

3、根据州统计赞助信息
根据候选⼈和州对数据进⾏聚合是常规操作:
grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals = totals[totals.sum(1) > 100000]         # 过滤掉行的和小于100000的行
totals[:10]             # 输出如下:每一行的和都大于100000
cand_nm       Obama, Barack      Romney, Mitt
contbr_st
AK                        281840.15              86204.24
AL                        543123.48             527303.51
AR                        359247.28             105556.00
AZ                      1506476.98           1888436.23
CA                    23824984.24         11237636.60
CO                      2132429.49           1506714.12
CT                       2068291.26           3499475.45
DC                      4373538.80           1025137.50
DE                        336669.14                82712.00
FL                       7318178.58            8338458.81

如果对各⾏除以总赞助额,就会得到各候选⼈在各州的总赞助额⽐例:
percent = totals.div(totals.sum(1), axis=0)
percent[:10]            # 输出如下:
cand_nm       Obama, Barack      Romney, Mitt
contbr_st
AK                          0.765778              0.234222
AL                          0.507390              0.492610
AR                          0.772902              0.227098
AZ                          0.443745              0.556255
CA                          0.679498              0.320502
CO                          0.585970              0.414030
CT                           0.371476              0.628524
DC                          0.810113               0.189887
DE                           0.802776              0.197224
FL                            0.467417              0.532583

六、总结
pandas在数据分析工作中非常有用,希望以后的工作中多使用,平时多练习,加深印象。