Pandas
的groupby()
功能很强大,用好了可以方便的解决很多问题,在数据处理以及日常工作中经常能施展拳脚。
使用Pandas
实现分组聚合需要分两步走。
第一步是指定分组变量,可以通过数据框的groupy()
完成;
第二步是对不同的数值变量计算各自的统计值。
1. groupby的基础操作
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
'B': [2, 8, 1, 4, 3, 2, 5, 9],
'C': [102, 98, 107, 104, 115, 87, 92, 123]})
按A列分组(groupby),获取其他列的均值
df.groupby('A').mean()
|
B |
C |
---|
A |
|
|
---|
a |
2.0 |
108.000000 |
---|
b |
6.5 |
95.000000 |
---|
c |
5.0 |
104.666667 |
---|
按多列进行分组(groupby)
df.groupby(['A','B']).mean()
|
|
C |
---|
A |
B |
|
---|
a |
1 |
107 |
---|
2 |
102 |
---|
3 |
115 |
---|
b |
5 |
92 |
---|
8 |
98 |
---|
c |
2 |
87 |
---|
4 |
104 |
---|
9 |
123 |
---|
分组后选择列进行运算
分组后,可以选取单列数据,或者多个列组成的列表(list)进行运算
df = pd.DataFrame([[1, 1, 2], [1, 2, 3], [2, 3, 4]], columns=["A", "B", "C"])
df
|
A |
B |
C |
---|
0 |
1 |
1 |
2 |
---|
1 |
1 |
2 |
3 |
---|
2 |
2 |
3 |
4 |
---|
g = df.groupby("A")
g['B'].mean()
A
1 1.5
2 3.0
Name: B, dtype: float64
g[['B', 'C']].mean()
|
B |
C |
---|
A |
|
|
---|
1 |
1.5 |
2.5 |
---|
2 |
3.0 |
4.0 |
---|
可以针对不同的列选用不同的聚合方法
g.agg({'B':'mean', 'C':'sum'})
2. 聚合方法size()和count()
size
跟count
的区别: size
计数时包含NaN值,而count
不包含NaN值
df = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
"City":["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"],
"Val":[4,3,3,np.nan,np.nan,4]})
df
|
Name |
City |
Val |
---|
0 |
Alice |
Seattle |
4.0 |
---|
1 |
Bob |
Seattle |
3.0 |
---|
2 |
Mallory |
Portland |
3.0 |
---|
3 |
Mallory |
Seattle |
NaN |
---|
4 |
Bob |
Seattle |
NaN |
---|
5 |
Mallory |
Portland |
4.0 |
---|
count()
df.groupby(["Name", "City"], as_index=False)['Val'].count()
|
Name |
City |
Val |
---|
0 |
Alice |
Seattle |
1 |
---|
1 |
Bob |
Seattle |
1 |
---|
2 |
Mallory |
Portland |
2 |
---|
3 |
Mallory |
Seattle |
0 |
---|
size()
df.groupby(["Name", "City"])['Val'].size().reset_index(name='Size')
|
Name |
City |
Size |
---|
0 |
Alice |
Seattle |
1 |
---|
1 |
Bob |
Seattle |
2 |
---|
2 |
Mallory |
Portland |
2 |
---|
3 |
Mallory |
Seattle |
1 |
---|
3. 分组运算方法 agg()
针对某列使用agg()
时进行不同的统计运算
df = pd.DataFrame({'A': list('XYZXYZXYZX'),
'B': [1, 2, 1, 3, 1, 2, 3, 3, 1, 2],
'C': [12, 14, 11, 12, 13, 14, 16, 12, 10, 19]})
df
|
A |
B |
C |
---|
0 |
X |
1 |
12 |
---|
1 |
Y |
2 |
14 |
---|
2 |
Z |
1 |
11 |
---|
3 |
X |
3 |
12 |
---|
4 |
Y |
1 |
13 |
---|
5 |
Z |
2 |
14 |
---|
6 |
X |
3 |
16 |
---|
7 |
Y |
3 |
12 |
---|
8 |
Z |
1 |
10 |
---|
9 |
X |
2 |
19 |
---|
df.groupby('A')['B'].agg({'mean':np.mean, 'standard deviation': np.std})
e:\Anaconda3.5\envs\ccf3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version. Use named aggregation instead.
>>> grouper.agg(name_1=func_1, name_2=func_2)
"""Entry point for launching an IPython kernel.
|
mean |
standard deviation |
---|
A |
|
|
---|
X |
2.250000 |
0.957427 |
---|
Y |
2.000000 |
1.000000 |
---|
Z |
1.333333 |
0.577350 |
---|
针对不同的列应用多种不同的统计方法
df.groupby('A').agg({'B':[np.mean, 'sum'], 'C':['count',np.std]})
|
B |
C |
---|
|
mean |
sum |
count |
std |
---|
A |
|
|
|
|
---|
X |
2.250000 |
9 |
4 |
3.403430 |
---|
Y |
2.000000 |
6 |
3 |
1.000000 |
---|
Z |
1.333333 |
4 |
3 |
2.081666 |
---|
4. 分组运算方法 apply()
df = pd.DataFrame({'A': list('XYZXYZXYZX'),
'B': [1, 2, 1, 3, 1, 2, 3, 3, 1, 2],
'C': [12, 14, 11, 12, 13, 14, 16, 12, 10, 19]})
df
|
A |
B |
C |
---|
0 |
X |
1 |
12 |
---|
1 |
Y |
2 |
14 |
---|
2 |
Z |
1 |
11 |
---|
3 |
X |
3 |
12 |
---|
4 |
Y |
1 |
13 |
---|
5 |
Z |
2 |
14 |
---|
6 |
X |
3 |
16 |
---|
7 |
Y |
3 |
12 |
---|
8 |
Z |
1 |
10 |
---|
9 |
X |
2 |
19 |
---|
df.groupby('A').apply(np.mean)
|
B |
C |
---|
A |
|
|
---|
X |
2.250000 |
14.750000 |
---|
Y |
2.000000 |
13.000000 |
---|
Z |
1.333333 |
11.666667 |
---|
apply()
方法可以应用lambda
函数,举例如下:
df.groupby('A').apply(lambda x: x['C']-x['B'])
A
X 0 11
3 9
6 13
9 17
Y 1 12
4 12
7 9
Z 2 10
5 12
8 9
dtype: int64
df.groupby('A').apply(lambda x: (x['C']-x['B']).mean())
A
X 12.500000
Y 11.000000
Z 10.333333
dtype: float64
5. 分组运算方法 transform()
前面进行聚合运算的时候,得到的结果是一个以分组名为index
的结果对象。如果我们想使用原数组的index
的话,就需要进行 merge
转换。transform(func, args, *kwargs)
方法简化了这个过程,它会把 func
参数应用到所有分组,然后把结果放置到原数组的 index
上(如果结果是一个标量,就进行广播)
df = pd.DataFrame({'group1' : ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'group2' : ['C', 'C', 'C', 'D', 'E', 'E', 'F', 'F'],
'B' : ['one', np.NaN, np.NaN, np.NaN, np.NaN, 'two', np.NaN, np.NaN],
'C' : [np.NaN, 1, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, 4]})
df
|
group1 |
group2 |
B |
C |
---|
0 |
A |
C |
one |
NaN |
---|
1 |
A |
C |
NaN |
1.0 |
---|
2 |
A |
C |
NaN |
NaN |
---|
3 |
A |
D |
NaN |
NaN |
---|
4 |
B |
E |
NaN |
NaN |
---|
5 |
B |
E |
two |
NaN |
---|
6 |
B |
F |
NaN |
NaN |
---|
7 |
B |
F |
NaN |
4.0 |
---|
df.groupby(['group1', 'group2'])['B'].transform('count')
0 1
1 1
2 1
3 0
4 1
5 1
6 0
7 0
Name: B, dtype: int64
df['count_B'] = df.groupby(['group1', 'group2'])['B'].transform('count')
df
|
group1 |
group2 |
B |
C |
count_B |
---|
0 |
A |
C |
one |
NaN |
1 |
---|
1 |
A |
C |
NaN |
1.0 |
1 |
---|
2 |
A |
C |
NaN |
NaN |
1 |
---|
3 |
A |
D |
NaN |
NaN |
0 |
---|
4 |
B |
E |
NaN |
NaN |
1 |
---|
5 |
B |
E |
two |
NaN |
1 |
---|
6 |
B |
F |
NaN |
NaN |
0 |
---|
7 |
B |
F |
NaN |
4.0 |
0 |
---|
上面运算的结果分析:
{‘group1’:’A’, ‘group2’:’C’}
的组合共出现3次,即index
为0,1,2。
对应”B”列的值分别是”one”,”NaN”,”NaN”,由于count()
计数时不包括Nan值,因此{‘group1’:’A’, ‘group2’:’C’}
的count
计数值为1。
transform()
方法会将该计数值在dataframe
中所有涉及的rows
都显示出来(我理解应该就进行广播)
将某列数据按数据值分成不同范围段进行分组(groupby)运算
np.random.seed(0)
df = pd.DataFrame({'Age': np.random.randint(20, 70, 100),
'Sex': np.random.choice(['Male', 'Female'], 100),
'number_of_foo': np.random.randint(1, 20, 100)})
df.head()
|
Age |
Sex |
number_of_foo |
---|
0 |
64 |
Female |
14 |
---|
1 |
67 |
Female |
14 |
---|
2 |
20 |
Female |
12 |
---|
3 |
23 |
Male |
17 |
---|
4 |
23 |
Female |
15 |
---|
这里将“Age”列分成三类,有两种方法可以实现:
(a)bins=4
(b)bins=[19, 40, 65, np.inf]
pd.cut(df['Age'], bins=4)
0 (56.75, 69.0]
1 (56.75, 69.0]
2 (19.951, 32.25]
3 (19.951, 32.25]
4 (19.951, 32.25]
...
95 (32.25, 44.5]
96 (32.25, 44.5]
97 (32.25, 44.5]
98 (56.75, 69.0]
99 (56.75, 69.0]
Name: Age, Length: 100, dtype: category
Categories (4, interval[float64]): [(19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69.0]]
pd.cut(df['Age'], bins=[19,40,65,np.inf])
0 (40.0, 65.0]
1 (65.0, inf]
2 (19.0, 40.0]
3 (19.0, 40.0]
4 (19.0, 40.0]
...
95 (19.0, 40.0]
96 (19.0, 40.0]
97 (40.0, 65.0]
98 (65.0, inf]
99 (65.0, inf]
Name: Age, Length: 100, dtype: category
Categories (3, interval[float64]): [(19.0, 40.0] < (40.0, 65.0] < (65.0, inf]]
age_groups = pd.cut(df['Age'], bins=[19,40,65,np.inf])
df.groupby(age_groups).mean()
|
Age |
number_of_foo |
---|
Age |
|
|
---|
(19.0, 40.0] |
29.840000 |
9.880000 |
---|
(40.0, 65.0] |
52.833333 |
9.452381 |
---|
(65.0, inf] |
67.375000 |
9.250000 |
---|
按‘Age’分组范围和性别(sex)进行制作交叉表
pd.crosstab(age_groups, df['Sex'])
Sex |
Female |
Male |
---|
Age |
|
|
---|
(19.0, 40.0] |
22 |
28 |
---|
(40.0, 65.0] |
18 |
24 |
---|
(65.0, inf] |
3 |
5 |
---|
6. 参考文章:
http://stackoverflow.com/documentation/pandas/18