1 统计婴儿姓名

婴儿的姓名,能反映什么呢?很多,比如某个名字的使用人数,流行程度,人口结构变化等等。下边就让我们来探索名字中隐藏的秘密吧~

1.1 下载数据

仍然是最常规的下载数据,然后显示原数据,看看原始数据文件里边都是什么样的,然后再想怎么处理,要得到什么的结果。

import pandas as pd

#查看原始数据
print open(r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\names\yob1880.txt').readline()

names1880 = pd.read_csv(r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\names\yob1880.txt',names=['name','sex','births'])
print type(names1880)
names1880[:5]
Mary,F,7065

    <class 'pandas.core.frame.DataFrame'>


name

sex

births

0

Mary

F

7065

1

Anna

F

2604

2

Emma

F

2003

3

Elizabeth

F

1939

4

Minnie

F

1746


1.2 查看1880年男女婴儿的出生数

print names1880.shape
print len(names1880)
print names1880.size#很显然,size是三列的乘积
names1880.groupby(['sex']).sum()
(2000, 3)
    2000
    6000


births

sex

F

90993

M

110493


1.3 实现多个txt文本文件的融合(多个DataFrame的联结)

#由于统计的是1880-2011年的婴儿名字
years = range(1880,2011)
columns = ['name','sex','births']
pieces = []
for year in years:
    path = r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\names\yob%d.txt'%year
    frame = pd.read_csv(path, names = columns)
    frame['year'] = year
    pieces.append(frame)

names = pd.concat(pieces,ignore_index=True)
print len(names)
print names.shape
1690784
    (1690784, 4)

1.4 统计并可视化每年不同性别婴儿的出生数量

一般原来数据的三列会被挑选出来,做成透视表,其中两列做成行列表,第三列填充表中内容,并实现可视化

total_births_by_sex = pd.pivot_table(names,values = 'births', index ='year',columns='sex',aggfunc = sum)
total_births_by_sex.tail()#默认显示最后5行


sex

F

M

year

2006

1896468

2050234

2007

1916888

2069242

2008

1883645

2032310

2009

1827643

1973359

2010

1759010

1898382


total_births_by_sex.plot(title='total births by sex and year')
import matplotlib.pyplot as plt
plt.show()



Python文章中的姓名脱敏 python姓名分析_文本文件

1.5 找出最受欢迎的名字

在达到要求之前,需要在原来的names数据集增加一列prop表示每年每个名字在当年相同性别中的使用比例

def add_prop(group):
    briths = group.births.astype(float)
    group['prop'] = briths/briths.sum()
    return group
names = names.groupby(['year','sex']).apply(add_prop)
names[:5]


name

sex

births

year

prop

0

Mary

F

7065

1880

0.077643

1

Anna

F

2604

1880

0.028618

2

Emma

F

2003

1880

0.022013

3

Elizabeth

F

1939

1880

0.021309

4

Minnie

F

1746

1880

0.019188


import numpy as np
#检验分组总计值是否接近于1
np.allclose(names.groupby(['year','sex']).prop.sum(),1)
True
def get_top1000(group):
    return group.sort_index(by = 'births',ascending = False)
top1000 = names.groupby(['year','sex']).apply(get_top1000)
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
      from ipykernel import kernelapp as app
top1000.info()#查看top1000的相关信息
<class 'pandas.core.frame.DataFrame'>
    MultiIndex: 1690784 entries, (1880, F, 0) to (2010, M, 1690783)
    Data columns (total 5 columns):
    name      1690784 non-null object
    sex       1690784 non-null object
    births    1690784 non-null int64
    year      1690784 non-null int64
    prop      1690784 non-null float64
    dtypes: float64(1), int64(2), object(2)
    memory usage: 77.4+ MB

展示最受欢迎的名字

top1000[:5]


name

sex

births

year

prop

year

sex

1880

F

0

Mary

F

7065

1880

0.077643

1

Anna

F

2604

1880

0.028618

2

Emma

F

2003

1880

0.022013

3

Elizabeth

F

1939

1880

0.021309

4

Minnie

F

1746

1880

0.019188


可以发现在1880年名字为mary的女baby最多

1.6 分析并可视化某个名字的随时间变化趋势

#将top1000中男女分开
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']

boys[:5]


name

sex

births

year

prop

year

sex

1880

M

942

John

M

9655

1880

0.087381

943

William

M

9533

1880

0.086277

944

James

M

5927

1880

0.053641

945

Charles

M

5348

1880

0.048401

946

George

M

5126

1880

0.046392


girls[:5]


name

sex

births

year

prop

year

sex

1880

F

0

Mary

F

7065

1880

0.077643

1

Anna

F

2604

1880

0.028618

2

Emma

F

2003

1880

0.022013

3

Elizabeth

F

1939

1880

0.021309

4

Minnie

F

1746

1880

0.019188


#生成year、name和births的透视表
total_births = top1000.pivot_table(values = 'births',index = 'year',columns = 'name',aggfunc = sum)
total_births[:5]


name

Aaban

Aabid

Aabriella

Aadam

Aadan

Aadarsh

Aaden

Aadesh

Aadhav

Aadhavan


Zyrus

Zysean

Zyshaun

Zyshawn

Zyshon

Zyshonne

Zytavious

Zyvion

Zyyanna

Zzyzx

year

1880

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN


NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

1881

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN


NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

1882

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN


NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

1883

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN


NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

1884

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN


NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

5 rows × 88496 columns


total_births.info()
<class 'pandas.core.frame.DataFrame'>
    Int64Index: 131 entries, 1880 to 2010
    Columns: 88496 entries, Aaban to Zzyzx
    dtypes: float64(88496)
    memory usage: 88.4 MB
subset = total_births['John']
subset[:5]
year
    1880    9701.0
    1881    8795.0
    1882    9597.0
    1883    8934.0
    1884    9427.0
    Name: John, dtype: float64
subset = total_births[['John','Harry','Mary','Marilyn']]
subset[:5]


name

John

Harry

Mary

Marilyn

year

1880

9701.0

2158.0

7092.0

NaN

1881

8795.0

2002.0

6948.0

NaN

1882

9597.0

2246.0

8179.0

NaN

1883

8934.0

2116.0

8044.0

NaN

1884

9427.0

2338.0

9253.0

NaN


#可视化subset,注意该图是在运行三次出的,前两次有点小问题
subset.plot(subplots = True, figsize = (12,10),grid = False,title='Number of births per year')
plt.show()



Python文章中的姓名脱敏 python姓名分析_数据_02

为什么常见的名字会越来越少被使用?可能的原因是大家想让孩子的名字与众不同~

1.7 命名多样性增加

怎么来验证命名多样性的增加呢?在上述中top1000是按前prop选出的,prop表示某个名字每性别每年在所有婴儿出生数中的比例,当这个值降低时,说明其他人数少的名字的比例会增加,即可证明名字越来越不同。

tabal = top1000.pivot_table(values = 'prop',index = 'year',columns = 'sex',aggfunc = sum)
tabal[:5]


sex

F

M

year

1880

1.0

1.0

1881

1.0

1.0

1882

1.0

1.0

1883

1.0

1.0

1884

1.0

1.0


可以看出1880-1884年婴儿的名字几乎都是前1000名的名字集合中,命名多样性差

#tabal.plot(title ='sum of table1000.prop by year and sex',yticks = np.linspace(0,1.2,20),xticks=range(1880,2020,10))
tabal.plot(title ='sum of table1000.prop by year and sex')
plt.show()



Python文章中的姓名脱敏 python姓名分析_Python文章中的姓名脱敏_03

未解之谜???

另外一种展示名字多样性的方式

占出生总人口数50%的且具有较大prop的名字的个数随时间越来越多,那么也可以说明命名多样性在增加。那么分别统计每年不同性别出生数占50%的名字个数的变化趋势,从该趋势上就能看出命名多样性。

def get_quantile_count(group,q=0.5):
    group = group.sort_index(by = 'prop',ascending = False)
    return (group.prop.cumsum().searchsorted(q)+1)[0]
diversity = top1000.groupby(['year','sex']).apply(get_quantile_count)
print type(diversity)
diversity[:5]
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
      from ipykernel import kernelapp as app


    <class 'pandas.core.series.Series'>
year  sex
1880  F      38
      M      14
1881  F      38
      M      14
1882  F      38
dtype: int64
#具有多个索引的Serise可以展开
diversity = diversity.unstack('sex')
diversity[:5]


sex

F

M

year

1880

38

14

1881

38

14

1882

38

15

1883

39

15

1884

39

16


diversity.plot(title='number of popular names in top 50%')
plt.show()



Python文章中的姓名脱敏 python姓名分析_数据分析_04

可以看出该曲线呈现上升趋势,说明命名多样性。

1.8 男孩名与女孩名字的混用情况

在以前,通常男孩使用的名字,在现在,被越来越多的使用在女孩上。在这里,只考察以‘lesl’开头的一组名字,男生女生使用随时间的变化趋势。

找出top1000表中以lesl开通的名字

all_names = top1000.name.unique()#找出top1000中所有名字的集合(没有重复的名字)
all_names[:5]
array(['Mary', 'Anna', 'Emma', 'Elizabeth', 'Minnie'], dtype=object)
mask = np.array(['lesl' in x.lower() for x in all_names])
mask
array([False, False, False, ..., False, False, False], dtype=bool)
lesley_like = all_names[mask]
print lesley_like.shape
lesl_like = np.array([ x for x in all_names if x.lower().startswith('lesl')])
print lesl_like.shape
(24L,)
    (21L,)
lesl_top1000 = top1000[top1000.name.isin(lesl_like)]
lesl_top1000.groupby(['name']).births.sum()
name
    Lesle            187
    Leslea           349
    Leslee          4863
    Leslei            52
    Lesleigh         436
    Lesley         37945
    Lesleyann         86
    Lesleyanne        80
    Lesli           5473
    Leslian           27
    Lesliann           6
    Leslianne         10
    Leslie        371686
    Leslieann        465
    Leslieanne        93
    Lesliee            8
    Leslly             5
    Lesly          12407
    Leslyann          16
    Leslye          2295
    Leslyn           166
    Name: births, dtype: int64
lesl_top1000[:5]


name

sex

births

year

prop

year

sex

1880

F

654

Leslie

F

8

1880

0.000088

M

1108

Leslie

M

79

1880

0.000715

1881

F

2523

Leslie

F

11

1881

0.000120

M

3072

Leslie

M

92

1881

0.000913

1882

F

4593

Leslie

F

9

1882

0.000083


构造year、sex、以及births的透视图,并画出每年不同性别,名字在lesl_like集合中的出生数

lesl_pivot = lesl_top1000.pivot_table(values = 'births',index = 'year',columns = 'sex',aggfunc = sum)
lesl_pivot[:5]


sex

F

M

year

1880

8

79

1881

11

92

1882

9

128

1883

7

125

1884

15

125


lesl_pivot.plot(style = {'M':'k-','F':'k--'})
plt.show()



Python文章中的姓名脱敏 python姓名分析_python_05

上图显示命名在lesl_like集合中的婴儿出生数量的变化,但是我们没有考虑,每年男女baby出生总量的变化因素,使得结果并不是很清晰的显示男名女用的变化趋势,接下里,将男女baby出生总量的变化因素考虑进去,看看结果如何吧~

lesl_pivot_div = lesl_pivot.div(lesl_pivot.sum(1),axis = 0)
lesl_pivot_div.tail()


sex

F

M

year

2006

0.979139

0.020861

2007

0.978508

0.021492

2008

0.977437

0.022563

2009

0.971627

0.028373

2010

0.978482

0.021518


lesl_pivot_div.plot(style = {'M':'k-','F':'k--'})
plt.show()



Python文章中的姓名脱敏 python姓名分析_文本文件_06

瞧!结果很明显,1880-1940年之间lesl_like集合中的名字大部分是男孩,但是之后,女孩使用的比例发生了很大变化!

2 总结

本部分主要使用panda中的groupby、pivot_table、.plot灯等函数分析了婴儿名字的使用变化情况。