注:本教程为系列教程此章节接前面第一弹

本文向导

16 分组聚合、过滤、转换

16.1 准备数据

# 设置最多显示8列数据
pd.options.display.max_columns = 8

collage_data = pd.read_csv("pandasLearnData/college.csv")
collage_data.head(5)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
3 University of Alabama in Huntsville Huntsville AL 0.0 ... 0.4596 0.2640 45500 24097
4 Alabama State University Montgomery AL 1.0 ... 0.7554 0.1270 26600 33118.5

5 rows × 27 columns

16.2 定义聚合

gb_STABBR = collage_data.groupby("STABBR")

16.3 聚合的属性原理

16.3.1 聚合类别

type(gb_STABBR)
pandas.core.groupby.generic.DataFrameGroupBy

16.3.2 聚合类型的所用方法和属性

for attr in dir(gb_STABBR):
    print("" if attr.startswith("_") else attr+"\t",end="")
CITY	CURROPER	DISTANCEONLY	GRAD_DEBT_MDN_SUPP	HBCU	INSTNM	MD_EARN_WNE_P10	MENONLY	PCTFLOAN	PCTPELL	PPTUG_EF	RELAFFIL	SATMTMID	SATVRMID	STABBR	UG25ABV	UGDS	UGDS_2MOR	UGDS_AIAN	UGDS_ASIAN	UGDS_BLACK	UGDS_HISP	UGDS_NHPI	UGDS_NRA	UGDS_UNKN	UGDS_WHITE	WOMENONLY	agg	aggregate	all	any	apply	backfill	bfill	boxplot	corr	corrwith	count	cov	cumcount	cummax	cummin	cumprod	cumsum	describe	diff	dtypes	expanding	ffill	fillna	filter	first	get_group	groups	head	hist	idxmax	idxmin	indices	last	mad	max	mean	median	min	ndim	ngroup	ngroups	nth	nunique	ohlc	pad	pct_change	pipe	plot	prod	quantile	rank	resample	rolling	sem	shift	size	skew	std	sum	tail	take	transform	tshift	var	

16.3.3 查看分组数量

gb_STABBR.ngroups
59

16.3.4 返回分组的键值

# groups是返回一个字典 {"分组键":索引列表, ...}
gb_STABBR.groups.keys()
dict_keys(['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 'MP', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'PW', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'])

16.3.5 根据键获取某一个分组

gb_STABBR.get_group("AK")
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
60 University of Alaska Anchorage Anchorage AK 0.0 ... 0.2647 0.4386 42500 19449.5
61 Alaska Bible College Palmer AK 0.0 ... 0.2857 0.4286 NaN PrivacySuppressed
62 University of Alaska Fairbanks Fairbanks AK 0.0 ... 0.2550 0.4519 36200 19355
63 University of Alaska Southeast Juneau AK 0.0 ... 0.1996 0.5550 37400 16875
64 Alaska Pacific University Anchorage AK 0.0 ... 0.5297 0.4910 47000 23250
65 AVTEC-Alaska's Institute of Technology Seward AK 0.0 ... 0.0664 0.7127 33500 PrivacySuppressed
66 Charter College-Anchorage Anchorage AK 0.0 ... 0.7503 0.5472 39200 13875
67 Alaska Career College Anchorage AK 0.0 ... 0.7860 0.5612 28700 8994
5171 Ilisagvik College Barrow AK 0.0 ... 0.0000 0.6498 24900 PrivacySuppressed
5417 Alaska Christian College Soldotna AK 0.0 ... 0.6792 0.2264 NaN PrivacySuppressed

10 rows × 27 columns

16.3.6 DataFrameGroupBy对象是一个可迭代对象

from collections import Iterable
isinstance(gb_STABBR,Iterable)
True
# 显示前3个分组
i  = 0
for name,group in gb_STABBR:
    i += 1
    print("组名:",name,"\t\t类型:",type(group))
    display(group.head(2))
    if i >= 3:
        break
组名: AK 		类型: <class 'pandas.core.frame.DataFrame'>
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
60 University of Alaska Anchorage Anchorage AK 0.0 ... 0.2647 0.4386 42500 19449.5
61 Alaska Bible College Palmer AK 0.0 ... 0.2857 0.4286 NaN PrivacySuppressed

2 rows × 27 columns

组名: AL 		类型: <class 'pandas.core.frame.DataFrame'>
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5

2 rows × 27 columns

组名: AR 		类型: <class 'pandas.core.frame.DataFrame'>
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
128 University of Arkansas at Little Rock Little Rock AR 0.0 ... 0.4775 0.4062 33900 21736
129 University of Arkansas for Medical Sciences Little Rock AR 0.0 ... 0.6144 0.5133 61400 12500

2 rows × 27 columns

16.3.7 head显示每个分组的头几行

# 前面一个head用于空值每个分组内显示前几行,后面表示总共显示前几行
gb_STABBR.head(2).head(5)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
43 Prince Institute-Southeast Elmhurst IL 0.0 ... 0.9375 0.6569 PrivacySuppressed 20992
60 University of Alaska Anchorage Anchorage AK 0.0 ... 0.2647 0.4386 42500 19449.5
61 Alaska Bible College Palmer AK 0.0 ... 0.2857 0.4286 NaN PrivacySuppressed

5 rows × 27 columns

16.3.8 nth指定显示分组内的相应行

# 显示每个分组的第一行,最后一行
gb_STABBR.nth([1,-1]).head(5)
INSTNM CITY HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
STABBR
AK Alaska Bible College Palmer 0.0 0.0 ... 0.2857 0.4286 NaN PrivacySuppressed
AK Alaska Christian College Soldotna 0.0 0.0 ... 0.6792 0.2264 NaN PrivacySuppressed
AL University of Alabama at Birmingham Birmingham 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
AL Strayer University-Huntsville Campus Huntsville NaN NaN ... NaN NaN 49200 36173.5
AR Career Academy of Hair Design-Fayetteville Fayetteville NaN NaN ... NaN NaN NaN 6365

5 rows × 26 columns

16.4 对分组对象使用聚合函数

16.4.1 agg函数传入聚合函数名称

# 统计每个每个州的本科生人数
gb_STABBR["UGDS"].agg("sum").head(5)
STABBR
AK     24932.0
AL    248298.0
AR    134820.0
AS      1276.0
AZ    520439.0
Name: UGDS, dtype: float64

16.4.2 直接使用统计函数

gb_STABBR["UGDS"].sum().head(5)
STABBR
AK     24932.0
AL    248298.0
AR    134820.0
AS      1276.0
AZ    520439.0
Name: UGDS, dtype: float64

16.4.3 agg传入多种聚合函数进行统计

gb_STABBR["UGDS"].agg(["mean",np.sum,"std"]).head(5)
mean sum std
STABBR
AK 2493.200000 24932.0 4051.726650
AL 2789.865169 248298.0 4657.877043
AR 1644.146341 134820.0 3142.774213
AS 1276.000000 1276.0 NaN
AZ 4130.468254 520439.0 14893.640651

16.5 对多个字段进行分组

16.5.1 分组

#对州和城市进行分组
collage_data.groupby(["STABBR","CITY"]).head(2).head(5)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
3 University of Alabama in Huntsville Huntsville AL 0.0 ... 0.4596 0.2640 45500 24097
4 Alabama State University Montgomery AL 1.0 ... 0.7554 0.1270 26600 33118.5

5 rows × 27 columns

16.5.2 对多个字段分组并使用多种聚合方法

grouped = collage_data.groupby(["STABBR","CITY"]).agg({"INSTNM":"size","UGDS":["mean","std","sum"]})
grouped.head(5)
INSTNM UGDS
size mean std sum
STABBR CITY
AK Anchorage 4 4218.75 5922.370605 16875.0
Barrow 1 109.00 NaN 109.0
Fairbanks 1 5536.00 NaN 5536.0
Juneau 1 1428.00 NaN 1428.0
Palmer 1 27.00 NaN 27.0

16.6 消除多级索引

16.6.1 查看索引结构

# 可以看到列索引具有两级
grouped.columns
MultiIndex([('INSTNM', 'size'),
            (  'UGDS', 'mean'),
            (  'UGDS',  'std'),
            (  'UGDS',  'sum')],
           )

16.6.2 获取一级索引

columns_level1 = grouped.columns.get_level_values(0)
columns_level1
Index(['INSTNM', 'UGDS', 'UGDS', 'UGDS'], dtype='object')

16.6.3 获取二级索引

columns_level2 = grouped.columns.get_level_values(1)
columns_level2
Index(['size', 'mean', 'std', 'sum'], dtype='object')

16.6.4 合并索引

grouped.columns = columns_level1 + "_" + columns_level2
grouped.head(5)
INSTNM_size UGDS_mean UGDS_std UGDS_sum
STABBR CITY
AK Anchorage 4 4218.75 5922.370605 16875.0
Barrow 1 109.00 NaN 109.0
Fairbanks 1 5536.00 NaN 5536.0
Juneau 1 1428.00 NaN 1428.0
Palmer 1 27.00 NaN 27.0

16.6.5 行索引可直接利用reset_index清除

grouped.reset_index().head(5)
STABBR CITY INSTNM_size UGDS_mean UGDS_std UGDS_sum
0 AK Anchorage 4 4218.75 5922.370605 16875.0
1 AK Barrow 1 109.00 NaN 109.0
2 AK Fairbanks 1 5536.00 NaN 5536.0
3 AK Juneau 1 1428.00 NaN 1428.0
4 AK Palmer 1 27.00 NaN 27.0

16.6.6 在使用groupby方法时,传入参数不增加索引

# as_index:表示是否改变索引,sort:表示是否根据分组字段进行排序
collage_data.groupby(["STABBR","CITY"],as_index=False,sort=False)\
.agg({"INSTNM":"size","UGDS":["mean","std","sum"]})\
.head(5)
STABBR CITY INSTNM UGDS
size mean std sum
0 AL Normal 1 4206.000000 NaN 4206.0
1 AL Birmingham 14 2236.428571 3314.611283 31310.0
2 AL Montgomery 11 1562.300000 1717.282482 15623.0
3 AL Huntsville 7 1511.500000 2042.258627 9069.0
4 AL Tuscaloosa 3 11887.333333 15666.545258 35662.0

16.7 自定义聚合函数

16.7.1 定义聚合函数

def my_max(s):
    max_value = 0
    for i in  s:
        max_value = max_value if max_value > i and i != np.NAN else i
    return s.max()

16.7.2 使用自定义聚合函数

collage_data.groupby("STABBR")[["UGDS"]].agg(my_max).head(5)
UGDS
STABBR
AK 12865.0
AL 29851.0
AR 21405.0
AS 1276.0
AZ 151558.0

16.7.3 和自带的聚合函数混合使用

collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)
mean sum my_max
STABBR
AK 2493.200000 24932.0 12865.0
AL 2789.865169 248298.0 29851.0
AR 1644.146341 134820.0 21405.0
AS 1276.000000 1276.0 1276.0
AZ 4130.468254 520439.0 151558.0

16.7.4 修改列名

# 默认就是显示该聚合函数的名称(__name__属性)
# 方法1: 修改__name__属性即可
my_max.__name__ = "max"
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)
mean sum max
STABBR
AK 2493.200000 24932.0 12865.0
AL 2789.865169 248298.0 29851.0
AR 1644.146341 134820.0 21405.0
AS 1276.000000 1276.0 1276.0
AZ 4130.468254 520439.0 151558.0
# 方法2: 使用DataFrame的rename方法修改列名称
# inplace 参数表示是否在操作的DataFrame中进行修改
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).rename({"my_max":"max"},inplace=False).head(5)
mean sum max
STABBR
AK 2493.200000 24932.0 12865.0
AL 2789.865169 248298.0 29851.0
AR 1644.146341 134820.0 21405.0
AS 1276.000000 1276.0 1276.0
AZ 4130.468254 520439.0 151558.0

16.8 用 args 和 *kwargs 自定义聚合函数

16.8.1 定义聚合函数

def  my_between_cnt(s,low,high):
    return s.between(low,high).sum()

16.8.2 使用自定义聚合函数

collage_data.groupby("STABBR")["UGDS"].agg(my_between_cnt,10000,20000).head(5)
STABBR
AK    1.0
AL    4.0
AR    0.0
AS    0.0
AZ    4.0
Name: UGDS, dtype: float64

16.8.3 使用闭包简化代码

def init_agg_func(func,col_name,*args,**kwargs):
    def wrapper(s):
        return func(s,*args,**kwargs)
    wrapper.__name__ = col_name
    return wrapper

my_max = init_agg_func(np.max,"max")
between_10_20k = init_agg_func(my_between_cnt,"10k~20k",10000,20000)
between_20_30k = init_agg_func(my_between_cnt,"20k~30k",20000,30000)

# 统计每个州本科生人数的最大值,和在10k-20k,20k-30k之间的人数
collage_data.groupby("STABBR")["UGDS"].agg([my_max,between_10_20k,between_20_30k]).head(5)
max 10k~20k 20k~30k
STABBR
AK 12865.0 1.0 0.0
AL 29851.0 4.0 2.0
AR 21405.0 0.0 1.0
AS 1276.0 0.0 0.0
AZ 151558.0 4.0 2.0

16.9 使用filter对分组进行过滤

# 对DataFrameGroupBy对象使用过滤时,会根据回调函数返回的布尔值决定是否保留该分组
# 显示每个州本科生人数和大于100万的州
collage_data.groupby("STABBR").filter(lambda x:x["UGDS"].sum() > 1000000)
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
192 Academy of Art University San Francisco CA 0.0 ... 0.5524 0.4043 36000 35093
193 ITT Technical Institute-Rancho Cordova Rancho Cordova CA 0.0 ... 0.7667 0.7235 38800 25827.5
194 Academy of Chinese Culture and Health Sciences Oakland CA 0.0 ... NaN NaN NaN PrivacySuppressed
195 The Academy of Radio and TV Broadcasting Huntington Beach CA 0.0 ... 1.0000 0.4545 28400 9500
196 Avalon School of Cosmetology-Alameda Alameda CA 0.0 ... 0.6768 0.3387 21600 9860
... ... ... ... ... ... ... ... ... ...
7528 WestMed College - Merced Merced CA NaN ... NaN NaN NaN 15623.5
7529 Vantage College El Paso TX NaN ... NaN NaN NaN 9500
7530 SAE Institute of Technology San Francisco Emeryville CA NaN ... NaN NaN NaN 9500
7533 Bay Area Medical Academy - San Jose Satellite ... San Jose CA NaN ... NaN NaN NaN PrivacySuppressed
7534 Excel Learning Center-San Antonio South San Antonio TX NaN ... NaN NaN NaN 12125

1245 rows × 27 columns


INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
3 University of Alabama in Huntsville Huntsville AL 0.0 ... 0.4596 0.2640 45500 24097
4 Alabama State University Montgomery AL 1.0 ... 0.7554 0.1270 26600 33118.5
... ... ... ... ... ... ... ... ... ...
7530 SAE Institute of Technology San Francisco Emeryville CA NaN ... NaN NaN NaN 9500
7531 Rasmussen College - Overland Park Overland Park KS NaN ... NaN NaN NaN 21163
7532 National Personal Training Institute of Cleveland Highland Heights OH NaN ... NaN NaN NaN 6333
7533 Bay Area Medical Academy - San Jose Satellite ... San Jose CA NaN ... NaN NaN NaN PrivacySuppressed
7534 Excel Learning Center-San Antonio South San Antonio TX NaN ... NaN NaN NaN 12125

7535 rows × 27 columns

16.10 分组对象的 apply方法的使用

# 注意这里apply和DataFrame中的apply方法不一样,这里传入回调函数的是一个分组的DataFrame
# 你需要返回的是一个Series
# 求每个分组各个字段的平均值
collage_data.groupby("STABBR").apply(lambda df:df.mean()).head(5)
HBCU MENONLY WOMENONLY RELAFFIL ... CURROPER PCTPELL PCTFLOAN UG25ABV
STABBR
AK 0.000000 0.0 0.000000 0.300000 ... 1.000000 0.394530 0.381660 0.506240
AL 0.166667 0.0 0.011111 0.250000 ... 0.937500 0.603621 0.509734 0.387039
AR 0.048780 0.0 0.000000 0.209302 ... 0.965116 0.581470 0.505556 0.356059
AS 0.000000 0.0 0.000000 0.000000 ... 1.000000 0.724500 0.000000 0.177400
AZ 0.000000 0.0 0.000000 0.067669 ... 0.879699 0.549792 0.543702 0.480859

5 rows × 22 columns

16.11 用连续变量分组

16.11.1 cut概念

# 被划分成了六个片元(就是六个区间,将这些区间映射到传入的Series中)
my_cut = pd.cut(collage_data["UGDS"],bins=[-np.Inf,10000,20000,30000,40000,50000,np.Inf])
my_cut
0          (-inf, 10000.0]
1       (10000.0, 20000.0]
2          (-inf, 10000.0]
3          (-inf, 10000.0]
4          (-inf, 10000.0]
               ...        
7530                   NaN
7531                   NaN
7532                   NaN
7533                   NaN
7534                   NaN
Name: UGDS, Length: 7535, dtype: category
Categories (6, interval[float64]): [(-inf, 10000.0] < (10000.0, 20000.0] < (20000.0, 30000.0] < (30000.0, 40000.0] < (40000.0, 50000.0] < (50000.0, inf]]

16.11.2 使用cut进行分组

collage_data.groupby(my_cut,as_index=True)[["UGDS","INSTNM"]].max()
UGDS INSTNM
UGDS
(-inf, 10000.0] 9999.0 eClips School of Cosmetology and Barbering
(10000.0, 20000.0] 19979.0 Youngstown State University
(20000.0, 30000.0] 29977.0 West Virginia University
(30000.0, 40000.0] 39958.0 Valencia College
(40000.0, 50000.0] 49340.0 Western Governors University
(50000.0, inf] 151558.0 University of Phoenix-Arizona