注:本教程为系列教程此章节接前面第一弹
16 分组聚合、过滤、转换
16.1 准备数据
# 设置最多显示8列数据
pd.options.display.max_columns = 8
collage_data = pd.read_csv("pandasLearnData/college.csv")
collage_data.head(5)
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
0 |
Alabama A & M University |
Normal |
AL |
1.0 |
... |
0.8284 |
0.1049 |
30300 |
33888 |
1 |
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
2 |
Amridge University |
Montgomery |
AL |
0.0 |
... |
0.7795 |
0.8540 |
40100 |
23370 |
3 |
University of Alabama in Huntsville |
Huntsville |
AL |
0.0 |
... |
0.4596 |
0.2640 |
45500 |
24097 |
4 |
Alabama State University |
Montgomery |
AL |
1.0 |
... |
0.7554 |
0.1270 |
26600 |
33118.5 |
5 rows × 27 columns
16.2 定义聚合
gb_STABBR = collage_data.groupby("STABBR")
16.3 聚合的属性原理
16.3.1 聚合类别
type(gb_STABBR)
pandas.core.groupby.generic.DataFrameGroupBy
16.3.2 聚合类型的所用方法和属性
for attr in dir(gb_STABBR):
print("" if attr.startswith("_") else attr+"\t",end="")
CITY CURROPER DISTANCEONLY GRAD_DEBT_MDN_SUPP HBCU INSTNM MD_EARN_WNE_P10 MENONLY PCTFLOAN PCTPELL PPTUG_EF RELAFFIL SATMTMID SATVRMID STABBR UG25ABV UGDS UGDS_2MOR UGDS_AIAN UGDS_ASIAN UGDS_BLACK UGDS_HISP UGDS_NHPI UGDS_NRA UGDS_UNKN UGDS_WHITE WOMENONLY agg aggregate all any apply backfill bfill boxplot corr corrwith count cov cumcount cummax cummin cumprod cumsum describe diff dtypes expanding ffill fillna filter first get_group groups head hist idxmax idxmin indices last mad max mean median min ndim ngroup ngroups nth nunique ohlc pad pct_change pipe plot prod quantile rank resample rolling sem shift size skew std sum tail take transform tshift var
16.3.3 查看分组数量
gb_STABBR.ngroups
59
16.3.4 返回分组的键值
# groups是返回一个字典 {"分组键":索引列表, ...}
gb_STABBR.groups.keys()
dict_keys(['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 'MP', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'PW', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'])
16.3.5 根据键获取某一个分组
gb_STABBR.get_group("AK")
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
60 |
University of Alaska Anchorage |
Anchorage |
AK |
0.0 |
... |
0.2647 |
0.4386 |
42500 |
19449.5 |
61 |
Alaska Bible College |
Palmer |
AK |
0.0 |
... |
0.2857 |
0.4286 |
NaN |
PrivacySuppressed |
62 |
University of Alaska Fairbanks |
Fairbanks |
AK |
0.0 |
... |
0.2550 |
0.4519 |
36200 |
19355 |
63 |
University of Alaska Southeast |
Juneau |
AK |
0.0 |
... |
0.1996 |
0.5550 |
37400 |
16875 |
64 |
Alaska Pacific University |
Anchorage |
AK |
0.0 |
... |
0.5297 |
0.4910 |
47000 |
23250 |
65 |
AVTEC-Alaska's Institute of Technology |
Seward |
AK |
0.0 |
... |
0.0664 |
0.7127 |
33500 |
PrivacySuppressed |
66 |
Charter College-Anchorage |
Anchorage |
AK |
0.0 |
... |
0.7503 |
0.5472 |
39200 |
13875 |
67 |
Alaska Career College |
Anchorage |
AK |
0.0 |
... |
0.7860 |
0.5612 |
28700 |
8994 |
5171 |
Ilisagvik College |
Barrow |
AK |
0.0 |
... |
0.0000 |
0.6498 |
24900 |
PrivacySuppressed |
5417 |
Alaska Christian College |
Soldotna |
AK |
0.0 |
... |
0.6792 |
0.2264 |
NaN |
PrivacySuppressed |
10 rows × 27 columns
16.3.6 DataFrameGroupBy对象是一个可迭代对象
from collections import Iterable
isinstance(gb_STABBR,Iterable)
True
# 显示前3个分组
i = 0
for name,group in gb_STABBR:
i += 1
print("组名:",name,"\t\t类型:",type(group))
display(group.head(2))
if i >= 3:
break
组名: AK 类型: <class 'pandas.core.frame.DataFrame'>
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
60 |
University of Alaska Anchorage |
Anchorage |
AK |
0.0 |
... |
0.2647 |
0.4386 |
42500 |
19449.5 |
61 |
Alaska Bible College |
Palmer |
AK |
0.0 |
... |
0.2857 |
0.4286 |
NaN |
PrivacySuppressed |
2 rows × 27 columns
组名: AL 类型: <class 'pandas.core.frame.DataFrame'>
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
0 |
Alabama A & M University |
Normal |
AL |
1.0 |
... |
0.8284 |
0.1049 |
30300 |
33888 |
1 |
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
2 rows × 27 columns
组名: AR 类型: <class 'pandas.core.frame.DataFrame'>
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
128 |
University of Arkansas at Little Rock |
Little Rock |
AR |
0.0 |
... |
0.4775 |
0.4062 |
33900 |
21736 |
129 |
University of Arkansas for Medical Sciences |
Little Rock |
AR |
0.0 |
... |
0.6144 |
0.5133 |
61400 |
12500 |
2 rows × 27 columns
16.3.7 head显示每个分组的头几行
# 前面一个head用于空值每个分组内显示前几行,后面表示总共显示前几行
gb_STABBR.head(2).head(5)
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
0 |
Alabama A & M University |
Normal |
AL |
1.0 |
... |
0.8284 |
0.1049 |
30300 |
33888 |
1 |
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
43 |
Prince Institute-Southeast |
Elmhurst |
IL |
0.0 |
... |
0.9375 |
0.6569 |
PrivacySuppressed |
20992 |
60 |
University of Alaska Anchorage |
Anchorage |
AK |
0.0 |
... |
0.2647 |
0.4386 |
42500 |
19449.5 |
61 |
Alaska Bible College |
Palmer |
AK |
0.0 |
... |
0.2857 |
0.4286 |
NaN |
PrivacySuppressed |
5 rows × 27 columns
16.3.8 nth指定显示分组内的相应行
# 显示每个分组的第一行,最后一行
gb_STABBR.nth([1,-1]).head(5)
|
INSTNM |
CITY |
HBCU |
MENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
STABBR |
|
|
|
|
|
|
|
|
|
---|
AK |
Alaska Bible College |
Palmer |
0.0 |
0.0 |
... |
0.2857 |
0.4286 |
NaN |
PrivacySuppressed |
AK |
Alaska Christian College |
Soldotna |
0.0 |
0.0 |
... |
0.6792 |
0.2264 |
NaN |
PrivacySuppressed |
AL |
University of Alabama at Birmingham |
Birmingham |
0.0 |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
AL |
Strayer University-Huntsville Campus |
Huntsville |
NaN |
NaN |
... |
NaN |
NaN |
49200 |
36173.5 |
AR |
Career Academy of Hair Design-Fayetteville |
Fayetteville |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
6365 |
5 rows × 26 columns
16.4 对分组对象使用聚合函数
16.4.1 agg函数传入聚合函数名称
# 统计每个每个州的本科生人数
gb_STABBR["UGDS"].agg("sum").head(5)
STABBR
AK 24932.0
AL 248298.0
AR 134820.0
AS 1276.0
AZ 520439.0
Name: UGDS, dtype: float64
16.4.2 直接使用统计函数
gb_STABBR["UGDS"].sum().head(5)
STABBR
AK 24932.0
AL 248298.0
AR 134820.0
AS 1276.0
AZ 520439.0
Name: UGDS, dtype: float64
16.4.3 agg传入多种聚合函数进行统计
gb_STABBR["UGDS"].agg(["mean",np.sum,"std"]).head(5)
|
mean |
sum |
std |
---|
STABBR |
|
|
|
---|
AK |
2493.200000 |
24932.0 |
4051.726650 |
AL |
2789.865169 |
248298.0 |
4657.877043 |
AR |
1644.146341 |
134820.0 |
3142.774213 |
AS |
1276.000000 |
1276.0 |
NaN |
AZ |
4130.468254 |
520439.0 |
14893.640651 |
16.5 对多个字段进行分组
16.5.1 分组
#对州和城市进行分组
collage_data.groupby(["STABBR","CITY"]).head(2).head(5)
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
0 |
Alabama A & M University |
Normal |
AL |
1.0 |
... |
0.8284 |
0.1049 |
30300 |
33888 |
1 |
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
2 |
Amridge University |
Montgomery |
AL |
0.0 |
... |
0.7795 |
0.8540 |
40100 |
23370 |
3 |
University of Alabama in Huntsville |
Huntsville |
AL |
0.0 |
... |
0.4596 |
0.2640 |
45500 |
24097 |
4 |
Alabama State University |
Montgomery |
AL |
1.0 |
... |
0.7554 |
0.1270 |
26600 |
33118.5 |
5 rows × 27 columns
16.5.2 对多个字段分组并使用多种聚合方法
grouped = collage_data.groupby(["STABBR","CITY"]).agg({"INSTNM":"size","UGDS":["mean","std","sum"]})
grouped.head(5)
|
|
INSTNM |
UGDS |
---|
|
|
size |
mean |
std |
sum |
---|
STABBR |
CITY |
|
|
|
|
---|
AK |
Anchorage |
4 |
4218.75 |
5922.370605 |
16875.0 |
Barrow |
1 |
109.00 |
NaN |
109.0 |
Fairbanks |
1 |
5536.00 |
NaN |
5536.0 |
Juneau |
1 |
1428.00 |
NaN |
1428.0 |
Palmer |
1 |
27.00 |
NaN |
27.0 |
16.6 消除多级索引
16.6.1 查看索引结构
# 可以看到列索引具有两级
grouped.columns
MultiIndex([('INSTNM', 'size'),
( 'UGDS', 'mean'),
( 'UGDS', 'std'),
( 'UGDS', 'sum')],
)
16.6.2 获取一级索引
columns_level1 = grouped.columns.get_level_values(0)
columns_level1
Index(['INSTNM', 'UGDS', 'UGDS', 'UGDS'], dtype='object')
16.6.3 获取二级索引
columns_level2 = grouped.columns.get_level_values(1)
columns_level2
Index(['size', 'mean', 'std', 'sum'], dtype='object')
16.6.4 合并索引
grouped.columns = columns_level1 + "_" + columns_level2
grouped.head(5)
|
|
INSTNM_size |
UGDS_mean |
UGDS_std |
UGDS_sum |
---|
STABBR |
CITY |
|
|
|
|
---|
AK |
Anchorage |
4 |
4218.75 |
5922.370605 |
16875.0 |
Barrow |
1 |
109.00 |
NaN |
109.0 |
Fairbanks |
1 |
5536.00 |
NaN |
5536.0 |
Juneau |
1 |
1428.00 |
NaN |
1428.0 |
Palmer |
1 |
27.00 |
NaN |
27.0 |
16.6.5 行索引可直接利用reset_index清除
grouped.reset_index().head(5)
|
STABBR |
CITY |
INSTNM_size |
UGDS_mean |
UGDS_std |
UGDS_sum |
---|
0 |
AK |
Anchorage |
4 |
4218.75 |
5922.370605 |
16875.0 |
1 |
AK |
Barrow |
1 |
109.00 |
NaN |
109.0 |
2 |
AK |
Fairbanks |
1 |
5536.00 |
NaN |
5536.0 |
3 |
AK |
Juneau |
1 |
1428.00 |
NaN |
1428.0 |
4 |
AK |
Palmer |
1 |
27.00 |
NaN |
27.0 |
16.6.6 在使用groupby方法时,传入参数不增加索引
# as_index:表示是否改变索引,sort:表示是否根据分组字段进行排序
collage_data.groupby(["STABBR","CITY"],as_index=False,sort=False)\
.agg({"INSTNM":"size","UGDS":["mean","std","sum"]})\
.head(5)
|
STABBR |
CITY |
INSTNM |
UGDS |
---|
|
|
|
size |
mean |
std |
sum |
---|
0 |
AL |
Normal |
1 |
4206.000000 |
NaN |
4206.0 |
1 |
AL |
Birmingham |
14 |
2236.428571 |
3314.611283 |
31310.0 |
2 |
AL |
Montgomery |
11 |
1562.300000 |
1717.282482 |
15623.0 |
3 |
AL |
Huntsville |
7 |
1511.500000 |
2042.258627 |
9069.0 |
4 |
AL |
Tuscaloosa |
3 |
11887.333333 |
15666.545258 |
35662.0 |
16.7 自定义聚合函数
16.7.1 定义聚合函数
def my_max(s):
max_value = 0
for i in s:
max_value = max_value if max_value > i and i != np.NAN else i
return s.max()
16.7.2 使用自定义聚合函数
collage_data.groupby("STABBR")[["UGDS"]].agg(my_max).head(5)
|
UGDS |
---|
STABBR |
|
---|
AK |
12865.0 |
AL |
29851.0 |
AR |
21405.0 |
AS |
1276.0 |
AZ |
151558.0 |
16.7.3 和自带的聚合函数混合使用
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)
|
mean |
sum |
my_max |
---|
STABBR |
|
|
|
---|
AK |
2493.200000 |
24932.0 |
12865.0 |
AL |
2789.865169 |
248298.0 |
29851.0 |
AR |
1644.146341 |
134820.0 |
21405.0 |
AS |
1276.000000 |
1276.0 |
1276.0 |
AZ |
4130.468254 |
520439.0 |
151558.0 |
16.7.4 修改列名
# 默认就是显示该聚合函数的名称(__name__属性)
# 方法1: 修改__name__属性即可
my_max.__name__ = "max"
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)
|
mean |
sum |
max |
---|
STABBR |
|
|
|
---|
AK |
2493.200000 |
24932.0 |
12865.0 |
AL |
2789.865169 |
248298.0 |
29851.0 |
AR |
1644.146341 |
134820.0 |
21405.0 |
AS |
1276.000000 |
1276.0 |
1276.0 |
AZ |
4130.468254 |
520439.0 |
151558.0 |
# 方法2: 使用DataFrame的rename方法修改列名称
# inplace 参数表示是否在操作的DataFrame中进行修改
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).rename({"my_max":"max"},inplace=False).head(5)
|
mean |
sum |
max |
---|
STABBR |
|
|
|
---|
AK |
2493.200000 |
24932.0 |
12865.0 |
AL |
2789.865169 |
248298.0 |
29851.0 |
AR |
1644.146341 |
134820.0 |
21405.0 |
AS |
1276.000000 |
1276.0 |
1276.0 |
AZ |
4130.468254 |
520439.0 |
151558.0 |
16.8 用 args 和 *kwargs 自定义聚合函数
16.8.1 定义聚合函数
def my_between_cnt(s,low,high):
return s.between(low,high).sum()
16.8.2 使用自定义聚合函数
collage_data.groupby("STABBR")["UGDS"].agg(my_between_cnt,10000,20000).head(5)
STABBR
AK 1.0
AL 4.0
AR 0.0
AS 0.0
AZ 4.0
Name: UGDS, dtype: float64
16.8.3 使用闭包简化代码
def init_agg_func(func,col_name,*args,**kwargs):
def wrapper(s):
return func(s,*args,**kwargs)
wrapper.__name__ = col_name
return wrapper
my_max = init_agg_func(np.max,"max")
between_10_20k = init_agg_func(my_between_cnt,"10k~20k",10000,20000)
between_20_30k = init_agg_func(my_between_cnt,"20k~30k",20000,30000)
# 统计每个州本科生人数的最大值,和在10k-20k,20k-30k之间的人数
collage_data.groupby("STABBR")["UGDS"].agg([my_max,between_10_20k,between_20_30k]).head(5)
|
max |
10k~20k |
20k~30k |
---|
STABBR |
|
|
|
---|
AK |
12865.0 |
1.0 |
0.0 |
AL |
29851.0 |
4.0 |
2.0 |
AR |
21405.0 |
0.0 |
1.0 |
AS |
1276.0 |
0.0 |
0.0 |
AZ |
151558.0 |
4.0 |
2.0 |
16.9 使用filter对分组进行过滤
# 对DataFrameGroupBy对象使用过滤时,会根据回调函数返回的布尔值决定是否保留该分组
# 显示每个州本科生人数和大于100万的州
collage_data.groupby("STABBR").filter(lambda x:x["UGDS"].sum() > 1000000)
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
192 |
Academy of Art University |
San Francisco |
CA |
0.0 |
... |
0.5524 |
0.4043 |
36000 |
35093 |
193 |
ITT Technical Institute-Rancho Cordova |
Rancho Cordova |
CA |
0.0 |
... |
0.7667 |
0.7235 |
38800 |
25827.5 |
194 |
Academy of Chinese Culture and Health Sciences |
Oakland |
CA |
0.0 |
... |
NaN |
NaN |
NaN |
PrivacySuppressed |
195 |
The Academy of Radio and TV Broadcasting |
Huntington Beach |
CA |
0.0 |
... |
1.0000 |
0.4545 |
28400 |
9500 |
196 |
Avalon School of Cosmetology-Alameda |
Alameda |
CA |
0.0 |
... |
0.6768 |
0.3387 |
21600 |
9860 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
7528 |
WestMed College - Merced |
Merced |
CA |
NaN |
... |
NaN |
NaN |
NaN |
15623.5 |
7529 |
Vantage College |
El Paso |
TX |
NaN |
... |
NaN |
NaN |
NaN |
9500 |
7530 |
SAE Institute of Technology San Francisco |
Emeryville |
CA |
NaN |
... |
NaN |
NaN |
NaN |
9500 |
7533 |
Bay Area Medical Academy - San Jose Satellite ... |
San Jose |
CA |
NaN |
... |
NaN |
NaN |
NaN |
PrivacySuppressed |
7534 |
Excel Learning Center-San Antonio South |
San Antonio |
TX |
NaN |
... |
NaN |
NaN |
NaN |
12125 |
1245 rows × 27 columns
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
0 |
Alabama A & M University |
Normal |
AL |
1.0 |
... |
0.8284 |
0.1049 |
30300 |
33888 |
1 |
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
2 |
Amridge University |
Montgomery |
AL |
0.0 |
... |
0.7795 |
0.8540 |
40100 |
23370 |
3 |
University of Alabama in Huntsville |
Huntsville |
AL |
0.0 |
... |
0.4596 |
0.2640 |
45500 |
24097 |
4 |
Alabama State University |
Montgomery |
AL |
1.0 |
... |
0.7554 |
0.1270 |
26600 |
33118.5 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
7530 |
SAE Institute of Technology San Francisco |
Emeryville |
CA |
NaN |
... |
NaN |
NaN |
NaN |
9500 |
7531 |
Rasmussen College - Overland Park |
Overland Park |
KS |
NaN |
... |
NaN |
NaN |
NaN |
21163 |
7532 |
National Personal Training Institute of Cleveland |
Highland Heights |
OH |
NaN |
... |
NaN |
NaN |
NaN |
6333 |
7533 |
Bay Area Medical Academy - San Jose Satellite ... |
San Jose |
CA |
NaN |
... |
NaN |
NaN |
NaN |
PrivacySuppressed |
7534 |
Excel Learning Center-San Antonio South |
San Antonio |
TX |
NaN |
... |
NaN |
NaN |
NaN |
12125 |
7535 rows × 27 columns
16.10 分组对象的 apply方法的使用
# 注意这里apply和DataFrame中的apply方法不一样,这里传入回调函数的是一个分组的DataFrame
# 你需要返回的是一个Series
# 求每个分组各个字段的平均值
collage_data.groupby("STABBR").apply(lambda df:df.mean()).head(5)
|
HBCU |
MENONLY |
WOMENONLY |
RELAFFIL |
... |
CURROPER |
PCTPELL |
PCTFLOAN |
UG25ABV |
---|
STABBR |
|
|
|
|
|
|
|
|
|
---|
AK |
0.000000 |
0.0 |
0.000000 |
0.300000 |
... |
1.000000 |
0.394530 |
0.381660 |
0.506240 |
AL |
0.166667 |
0.0 |
0.011111 |
0.250000 |
... |
0.937500 |
0.603621 |
0.509734 |
0.387039 |
AR |
0.048780 |
0.0 |
0.000000 |
0.209302 |
... |
0.965116 |
0.581470 |
0.505556 |
0.356059 |
AS |
0.000000 |
0.0 |
0.000000 |
0.000000 |
... |
1.000000 |
0.724500 |
0.000000 |
0.177400 |
AZ |
0.000000 |
0.0 |
0.000000 |
0.067669 |
... |
0.879699 |
0.549792 |
0.543702 |
0.480859 |
5 rows × 22 columns
16.11 用连续变量分组
16.11.1 cut概念
# 被划分成了六个片元(就是六个区间,将这些区间映射到传入的Series中)
my_cut = pd.cut(collage_data["UGDS"],bins=[-np.Inf,10000,20000,30000,40000,50000,np.Inf])
my_cut
0 (-inf, 10000.0]
1 (10000.0, 20000.0]
2 (-inf, 10000.0]
3 (-inf, 10000.0]
4 (-inf, 10000.0]
...
7530 NaN
7531 NaN
7532 NaN
7533 NaN
7534 NaN
Name: UGDS, Length: 7535, dtype: category
Categories (6, interval[float64]): [(-inf, 10000.0] < (10000.0, 20000.0] < (20000.0, 30000.0] < (30000.0, 40000.0] < (40000.0, 50000.0] < (50000.0, inf]]
16.11.2 使用cut进行分组
collage_data.groupby(my_cut,as_index=True)[["UGDS","INSTNM"]].max()
|
UGDS |
INSTNM |
---|
UGDS |
|
|
---|
(-inf, 10000.0] |
9999.0 |
eClips School of Cosmetology and Barbering |
(10000.0, 20000.0] |
19979.0 |
Youngstown State University |
(20000.0, 30000.0] |
29977.0 |
West Virginia University |
(30000.0, 40000.0] |
39958.0 |
Valencia College |
(40000.0, 50000.0] |
49340.0 |
Western Governors University |
(50000.0, inf] |
151558.0 |
University of Phoenix-Arizona |