1 统计婴儿姓名
婴儿的姓名,能反映什么呢?很多,比如某个名字的使用人数,流行程度,人口结构变化等等。下边就让我们来探索名字中隐藏的秘密吧~
1.1 下载数据
仍然是最常规的下载数据,然后显示原数据,看看原始数据文件里边都是什么样的,然后再想怎么处理,要得到什么的结果。
import pandas as pd
#查看原始数据
print open(r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\names\yob1880.txt').readline()
names1880 = pd.read_csv(r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\names\yob1880.txt',names=['name','sex','births'])
print type(names1880)
names1880[:5]
Mary,F,7065
<class 'pandas.core.frame.DataFrame'>
name | sex | births | |
0 | Mary | F | 7065 |
1 | Anna | F | 2604 |
2 | Emma | F | 2003 |
3 | Elizabeth | F | 1939 |
4 | Minnie | F | 1746 |
1.2 查看1880年男女婴儿的出生数
print names1880.shape
print len(names1880)
print names1880.size#很显然,size是三列的乘积
names1880.groupby(['sex']).sum()
(2000, 3)
2000
6000
births | |
sex | |
F | 90993 |
M | 110493 |
1.3 实现多个txt文本文件的融合(多个DataFrame的联结)
#由于统计的是1880-2011年的婴儿名字
years = range(1880,2011)
columns = ['name','sex','births']
pieces = []
for year in years:
path = r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\names\yob%d.txt'%year
frame = pd.read_csv(path, names = columns)
frame['year'] = year
pieces.append(frame)
names = pd.concat(pieces,ignore_index=True)
print len(names)
print names.shape
1690784
(1690784, 4)
1.4 统计并可视化每年不同性别婴儿的出生数量
一般原来数据的三列会被挑选出来,做成透视表,其中两列做成行列表,第三列填充表中内容,并实现可视化
total_births_by_sex = pd.pivot_table(names,values = 'births', index ='year',columns='sex',aggfunc = sum)
total_births_by_sex.tail()#默认显示最后5行
sex | F | M |
year | ||
2006 | 1896468 | 2050234 |
2007 | 1916888 | 2069242 |
2008 | 1883645 | 2032310 |
2009 | 1827643 | 1973359 |
2010 | 1759010 | 1898382 |
total_births_by_sex.plot(title='total births by sex and year')
import matplotlib.pyplot as plt
plt.show()
1.5 找出最受欢迎的名字
在达到要求之前,需要在原来的names数据集增加一列prop表示每年每个名字在当年相同性别中的使用比例
def add_prop(group):
briths = group.births.astype(float)
group['prop'] = briths/briths.sum()
return group
names = names.groupby(['year','sex']).apply(add_prop)
names[:5]
name | sex | births | year | prop | |
0 | Mary | F | 7065 | 1880 | 0.077643 |
1 | Anna | F | 2604 | 1880 | 0.028618 |
2 | Emma | F | 2003 | 1880 | 0.022013 |
3 | Elizabeth | F | 1939 | 1880 | 0.021309 |
4 | Minnie | F | 1746 | 1880 | 0.019188 |
import numpy as np
#检验分组总计值是否接近于1
np.allclose(names.groupby(['year','sex']).prop.sum(),1)
True
def get_top1000(group):
return group.sort_index(by = 'births',ascending = False)
top1000 = names.groupby(['year','sex']).apply(get_top1000)
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
from ipykernel import kernelapp as app
top1000.info()#查看top1000的相关信息
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1690784 entries, (1880, F, 0) to (2010, M, 1690783)
Data columns (total 5 columns):
name 1690784 non-null object
sex 1690784 non-null object
births 1690784 non-null int64
year 1690784 non-null int64
prop 1690784 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 77.4+ MB
展示最受欢迎的名字
top1000[:5]
name | sex | births | year | prop | |||
year | sex | ||||||
1880 | F | 0 | Mary | F | 7065 | 1880 | 0.077643 |
1 | Anna | F | 2604 | 1880 | 0.028618 | ||
2 | Emma | F | 2003 | 1880 | 0.022013 | ||
3 | Elizabeth | F | 1939 | 1880 | 0.021309 | ||
4 | Minnie | F | 1746 | 1880 | 0.019188 |
可以发现在1880年名字为mary的女baby最多
1.6 分析并可视化某个名字的随时间变化趋势
#将top1000中男女分开
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']
boys[:5]
name | sex | births | year | prop | |||
year | sex | ||||||
1880 | M | 942 | John | M | 9655 | 1880 | 0.087381 |
943 | William | M | 9533 | 1880 | 0.086277 | ||
944 | James | M | 5927 | 1880 | 0.053641 | ||
945 | Charles | M | 5348 | 1880 | 0.048401 | ||
946 | George | M | 5126 | 1880 | 0.046392 |
girls[:5]
name | sex | births | year | prop | |||
year | sex | ||||||
1880 | F | 0 | Mary | F | 7065 | 1880 | 0.077643 |
1 | Anna | F | 2604 | 1880 | 0.028618 | ||
2 | Emma | F | 2003 | 1880 | 0.022013 | ||
3 | Elizabeth | F | 1939 | 1880 | 0.021309 | ||
4 | Minnie | F | 1746 | 1880 | 0.019188 |
#生成year、name和births的透视表
total_births = top1000.pivot_table(values = 'births',index = 'year',columns = 'name',aggfunc = sum)
total_births[:5]
name | Aaban | Aabid | Aabriella | Aadam | Aadan | Aadarsh | Aaden | Aadesh | Aadhav | Aadhavan | … | Zyrus | Zysean | Zyshaun | Zyshawn | Zyshon | Zyshonne | Zytavious | Zyvion | Zyyanna | Zzyzx |
year | |||||||||||||||||||||
1880 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1881 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1882 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1883 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1884 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 88496 columns
total_births.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 88496 entries, Aaban to Zzyzx
dtypes: float64(88496)
memory usage: 88.4 MB
subset = total_births['John']
subset[:5]
year
1880 9701.0
1881 8795.0
1882 9597.0
1883 8934.0
1884 9427.0
Name: John, dtype: float64
subset = total_births[['John','Harry','Mary','Marilyn']]
subset[:5]
name | John | Harry | Mary | Marilyn |
year | ||||
1880 | 9701.0 | 2158.0 | 7092.0 | NaN |
1881 | 8795.0 | 2002.0 | 6948.0 | NaN |
1882 | 9597.0 | 2246.0 | 8179.0 | NaN |
1883 | 8934.0 | 2116.0 | 8044.0 | NaN |
1884 | 9427.0 | 2338.0 | 9253.0 | NaN |
#可视化subset,注意该图是在运行三次出的,前两次有点小问题
subset.plot(subplots = True, figsize = (12,10),grid = False,title='Number of births per year')
plt.show()
为什么常见的名字会越来越少被使用?可能的原因是大家想让孩子的名字与众不同~
1.7 命名多样性增加
怎么来验证命名多样性的增加呢?在上述中top1000是按前prop选出的,prop表示某个名字每性别每年在所有婴儿出生数中的比例,当这个值降低时,说明其他人数少的名字的比例会增加,即可证明名字越来越不同。
tabal = top1000.pivot_table(values = 'prop',index = 'year',columns = 'sex',aggfunc = sum)
tabal[:5]
sex | F | M |
year | ||
1880 | 1.0 | 1.0 |
1881 | 1.0 | 1.0 |
1882 | 1.0 | 1.0 |
1883 | 1.0 | 1.0 |
1884 | 1.0 | 1.0 |
可以看出1880-1884年婴儿的名字几乎都是前1000名的名字集合中,命名多样性差
#tabal.plot(title ='sum of table1000.prop by year and sex',yticks = np.linspace(0,1.2,20),xticks=range(1880,2020,10))
tabal.plot(title ='sum of table1000.prop by year and sex')
plt.show()
未解之谜???
另外一种展示名字多样性的方式
占出生总人口数50%的且具有较大prop的名字的个数随时间越来越多,那么也可以说明命名多样性在增加。那么分别统计每年不同性别出生数占50%的名字个数的变化趋势,从该趋势上就能看出命名多样性。
def get_quantile_count(group,q=0.5):
group = group.sort_index(by = 'prop',ascending = False)
return (group.prop.cumsum().searchsorted(q)+1)[0]
diversity = top1000.groupby(['year','sex']).apply(get_quantile_count)
print type(diversity)
diversity[:5]
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
from ipykernel import kernelapp as app
<class 'pandas.core.series.Series'>
year sex
1880 F 38
M 14
1881 F 38
M 14
1882 F 38
dtype: int64
#具有多个索引的Serise可以展开
diversity = diversity.unstack('sex')
diversity[:5]
sex | F | M |
year | ||
1880 | 38 | 14 |
1881 | 38 | 14 |
1882 | 38 | 15 |
1883 | 39 | 15 |
1884 | 39 | 16 |
diversity.plot(title='number of popular names in top 50%')
plt.show()
可以看出该曲线呈现上升趋势,说明命名多样性。
1.8 男孩名与女孩名字的混用情况
在以前,通常男孩使用的名字,在现在,被越来越多的使用在女孩上。在这里,只考察以‘lesl’开头的一组名字,男生女生使用随时间的变化趋势。
找出top1000表中以lesl开通的名字
all_names = top1000.name.unique()#找出top1000中所有名字的集合(没有重复的名字)
all_names[:5]
array(['Mary', 'Anna', 'Emma', 'Elizabeth', 'Minnie'], dtype=object)
mask = np.array(['lesl' in x.lower() for x in all_names])
mask
array([False, False, False, ..., False, False, False], dtype=bool)
lesley_like = all_names[mask]
print lesley_like.shape
lesl_like = np.array([ x for x in all_names if x.lower().startswith('lesl')])
print lesl_like.shape
(24L,)
(21L,)
lesl_top1000 = top1000[top1000.name.isin(lesl_like)]
lesl_top1000.groupby(['name']).births.sum()
name
Lesle 187
Leslea 349
Leslee 4863
Leslei 52
Lesleigh 436
Lesley 37945
Lesleyann 86
Lesleyanne 80
Lesli 5473
Leslian 27
Lesliann 6
Leslianne 10
Leslie 371686
Leslieann 465
Leslieanne 93
Lesliee 8
Leslly 5
Lesly 12407
Leslyann 16
Leslye 2295
Leslyn 166
Name: births, dtype: int64
lesl_top1000[:5]
name | sex | births | year | prop | |||
year | sex | ||||||
1880 | F | 654 | Leslie | F | 8 | 1880 | 0.000088 |
M | 1108 | Leslie | M | 79 | 1880 | 0.000715 | |
1881 | F | 2523 | Leslie | F | 11 | 1881 | 0.000120 |
M | 3072 | Leslie | M | 92 | 1881 | 0.000913 | |
1882 | F | 4593 | Leslie | F | 9 | 1882 | 0.000083 |
构造year、sex、以及births的透视图,并画出每年不同性别,名字在lesl_like集合中的出生数
lesl_pivot = lesl_top1000.pivot_table(values = 'births',index = 'year',columns = 'sex',aggfunc = sum)
lesl_pivot[:5]
sex | F | M |
year | ||
1880 | 8 | 79 |
1881 | 11 | 92 |
1882 | 9 | 128 |
1883 | 7 | 125 |
1884 | 15 | 125 |
lesl_pivot.plot(style = {'M':'k-','F':'k--'})
plt.show()
上图显示命名在lesl_like集合中的婴儿出生数量的变化,但是我们没有考虑,每年男女baby出生总量的变化因素,使得结果并不是很清晰的显示男名女用的变化趋势,接下里,将男女baby出生总量的变化因素考虑进去,看看结果如何吧~
lesl_pivot_div = lesl_pivot.div(lesl_pivot.sum(1),axis = 0)
lesl_pivot_div.tail()
sex | F | M |
year | ||
2006 | 0.979139 | 0.020861 |
2007 | 0.978508 | 0.021492 |
2008 | 0.977437 | 0.022563 |
2009 | 0.971627 | 0.028373 |
2010 | 0.978482 | 0.021518 |
lesl_pivot_div.plot(style = {'M':'k-','F':'k--'})
plt.show()
瞧!结果很明显,1880-1940年之间lesl_like集合中的名字大部分是男孩,但是之后,女孩使用的比例发生了很大变化!
2 总结
本部分主要使用panda中的groupby、pivot_table、.plot灯等函数分析了婴儿名字的使用变化情况。