​Pandas​​​的​​groupby()​​功能很强大,用好了可以方便的解决很多问题,在数据处理以及日常工作中经常能施展拳脚。

使用​​Pandas​​实现分组聚合需要分两步走。

第一步是指定分组变量,可以通过数据框的​​groupy()​​完成;

第二步是对不同的数值变量计算各自的统计值。

1. groupby的基础操作

import pandas as pd
import numpy as np


df = pd.DataFrame({'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
'B': [2, 8, 1, 4, 3, 2, 5, 9],
'C': [102, 98, 107, 104, 115, 87, 92, 123]})

按A列分组(groupby),获取其他列的均值

df.groupby('A').mean()

B

C

A

a

2.0

108.000000

b

6.5

95.000000

c

5.0

104.666667

按多列进行分组(groupby)

df.groupby(['A','B']).mean()

C

A

B

a

1

107

2

102

3

115

b

5

92

8

98

c

2

87

4

104

9

123

分组后选择列进行运算

分组后,可以选取单列数据,或者多个列组成的列表(list)进行运算

df = pd.DataFrame([[1, 1, 2], [1, 2, 3], [2, 3, 4]], columns=["A", "B", "C"])
df

A

B

C

0

1

1

2

1

1

2

3

2

2

3

4

g = df.groupby("A")
g['B'].mean() # 仅选择B列
A
1 1.5
2 3.0
Name: B, dtype: float64
g[['B', 'C']].mean() # 选择B、C列

B

C

A

1

1.5

2.5

2

3.0

4.0

可以针对不同的列选用不同的聚合方法

g.agg({'B':'mean', 'C':'sum'})

B

C

A

1

1.5

5

2

3.0

4

2. 聚合方法size()和count()

​size​​​跟​​count​​​的区别: ​​size​​​计数时包含NaN值,而​​count​​不包含NaN值

df = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
"City":["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"],
"Val":[4,3,3,np.nan,np.nan,4]})
df

Name

City

Val

0

Alice

Seattle

4.0

1

Bob

Seattle

3.0

2

Mallory

Portland

3.0

3

Mallory

Seattle

NaN

4

Bob

Seattle

NaN

5

Mallory

Portland

4.0

count()

df.groupby(["Name", "City"], as_index=False)['Val'].count()

Name

City

Val

0

Alice

Seattle

1

1

Bob

Seattle

1

2

Mallory

Portland

2

3

Mallory

Seattle

0

size()

df.groupby(["Name", "City"])['Val'].size().reset_index(name='Size')

Name

City

Size

0

Alice

Seattle

1

1

Bob

Seattle

2

2

Mallory

Portland

2

3

Mallory

Seattle

1

3. 分组运算方法 agg()

针对某列使用​​agg()​​时进行不同的统计运算

df = pd.DataFrame({'A': list('XYZXYZXYZX'), 
'B': [1, 2, 1, 3, 1, 2, 3, 3, 1, 2],
'C': [12, 14, 11, 12, 13, 14, 16, 12, 10, 19]})
df

A

B

C

0

X

1

12

1

Y

2

14

2

Z

1

11

3

X

3

12

4

Y

1

13

5

Z

2

14

6

X

3

16

7

Y

3

12

8

Z

1

10

9

X

2

19

df.groupby('A')['B'].agg({'mean':np.mean, 'standard deviation': np.std})
e:\Anaconda3.5\envs\ccf3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version. Use named aggregation instead.

>>> grouper.agg(name_1=func_1, name_2=func_2)

"""Entry point for launching an IPython kernel.

mean

standard deviation

A

X

2.250000

0.957427

Y

2.000000

1.000000

Z

1.333333

0.577350

针对不同的列应用多种不同的统计方法

df.groupby('A').agg({'B':[np.mean, 'sum'], 'C':['count',np.std]})

B

C

mean

sum

count

std

A

X

2.250000

9

4

3.403430

Y

2.000000

6

3

1.000000

Z

1.333333

4

3

2.081666

4. 分组运算方法 apply()

df = pd.DataFrame({'A': list('XYZXYZXYZX'), 
'B': [1, 2, 1, 3, 1, 2, 3, 3, 1, 2],
'C': [12, 14, 11, 12, 13, 14, 16, 12, 10, 19]})
df

A

B

C

0

X

1

12

1

Y

2

14

2

Z

1

11

3

X

3

12

4

Y

1

13

5

Z

2

14

6

X

3

16

7

Y

3

12

8

Z

1

10

9

X

2

19

df.groupby('A').apply(np.mean) 
# 跟下面的方法的运行结果是一致的
# df.groupby('A').mean()

B

C

A

X

2.250000

14.750000

Y

2.000000

13.000000

Z

1.333333

11.666667

​apply()​​​方法可以应用​​lambda​​函数,举例如下:

df.groupby('A').apply(lambda x: x['C']-x['B'])
A   
X 0 11
3 9
6 13
9 17
Y 1 12
4 12
7 9
Z 2 10
5 12
8 9
dtype: int64
df.groupby('A').apply(lambda x: (x['C']-x['B']).mean())
A
X 12.500000
Y 11.000000
Z 10.333333
dtype: float64

5. 分组运算方法 transform()

前面进行聚合运算的时候,得到的结果是一个以分组名为​​index​​​ 的结果对象。如果我们想使用原数组的​​index​​​ 的话,就需要进行 ​​merge​​​ 转换。​​transform(func, args, *kwargs)​​​ 方法简化了这个过程,它会把 ​​func​​​ 参数应用到所有分组,然后把结果放置到原数组的 ​​index​​ 上(如果结果是一个标量,就进行广播)

df = pd.DataFrame({'group1' :  ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'group2' : ['C', 'C', 'C', 'D', 'E', 'E', 'F', 'F'],
'B' : ['one', np.NaN, np.NaN, np.NaN, np.NaN, 'two', np.NaN, np.NaN],
'C' : [np.NaN, 1, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, 4]})
df

group1

group2

B

C

0

A

C

one

NaN

1

A

C

NaN

1.0

2

A

C

NaN

NaN

3

A

D

NaN

NaN

4

B

E

NaN

NaN

5

B

E

two

NaN

6

B

F

NaN

NaN

7

B

F

NaN

4.0

df.groupby(['group1', 'group2'])['B'].transform('count')
0    1
1 1
2 1
3 0
4 1
5 1
6 0
7 0
Name: B, dtype: int64
df['count_B'] = df.groupby(['group1', 'group2'])['B'].transform('count')
df

group1

group2

B

C

count_B

0

A

C

one

NaN

1

1

A

C

NaN

1.0

1

2

A

C

NaN

NaN

1

3

A

D

NaN

NaN

0

4

B

E

NaN

NaN

1

5

B

E

two

NaN

1

6

B

F

NaN

NaN

0

7

B

F

NaN

4.0

0

上面运算的结果分析:

​{‘group1’:’A’, ‘group2’:’C’}​​的组合共出现3次,即​​index​​为0,1,2。

对应”B”列的值分别是”one”,”NaN”,”NaN”,由于​​count()​​计数时不包括Nan值,因此​​{‘group1’:’A’, ‘group2’:’C’}​​的​​count​​计数值为1。

​transform()​​方法会将该计数值在​​dataframe​​中所有涉及的​​rows​​都显示出来(我理解应该就进行广播)

将某列数据按数据值分成不同范围段进行分组(groupby)运算

np.random.seed(0)
df = pd.DataFrame({'Age': np.random.randint(20, 70, 100),
'Sex': np.random.choice(['Male', 'Female'], 100),
'number_of_foo': np.random.randint(1, 20, 100)})
df.head()

Age

Sex

number_of_foo

0

64

Female

14

1

67

Female

14

2

20

Female

12

3

23

Male

17

4

23

Female

15

这里将“Age”列分成三类,有两种方法可以实现:

(a)bins=4

(b)bins=[19, 40, 65, np.inf]

pd.cut(df['Age'], bins=4)
0       (56.75, 69.0]
1 (56.75, 69.0]
2 (19.951, 32.25]
3 (19.951, 32.25]
4 (19.951, 32.25]
...
95 (32.25, 44.5]
96 (32.25, 44.5]
97 (32.25, 44.5]
98 (56.75, 69.0]
99 (56.75, 69.0]
Name: Age, Length: 100, dtype: category
Categories (4, interval[float64]): [(19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69.0]]
pd.cut(df['Age'], bins=[19,40,65,np.inf])
0     (40.0, 65.0]
1 (65.0, inf]
2 (19.0, 40.0]
3 (19.0, 40.0]
4 (19.0, 40.0]
...
95 (19.0, 40.0]
96 (19.0, 40.0]
97 (40.0, 65.0]
98 (65.0, inf]
99 (65.0, inf]
Name: Age, Length: 100, dtype: category
Categories (3, interval[float64]): [(19.0, 40.0] < (40.0, 65.0] < (65.0, inf]]
age_groups = pd.cut(df['Age'], bins=[19,40,65,np.inf])
df.groupby(age_groups).mean()

Age

number_of_foo

Age

(19.0, 40.0]

29.840000

9.880000

(40.0, 65.0]

52.833333

9.452381

(65.0, inf]

67.375000

9.250000

按‘Age’分组范围和性别(sex)进行制作交叉表

pd.crosstab(age_groups, df['Sex'])

Sex

Female

Male

Age

(19.0, 40.0]

22

28

(40.0, 65.0]

18

24

(65.0, inf]

3

5

6. 参考文章:

​http://stackoverflow.com/documentation/pandas/18​