注:本教程为系列教程此章节接前面第一弹
14 选取数据的子集
14.1 选取Series的行
14.1.1 从DataFrame中获取一列作为Series
city = college_data["CITY"]
print(city)
print("<"+"="*75+">")
print("类型为:",type(city))
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
...
Rasmussen College - Overland Park Overland Park
National Personal Training Institute of Cleveland Highland Heights
Bay Area Medical Academy - San Jose Satellite Location San Jose
Excel Learning Center-San Antonio South San Antonio
Name: CITY, Length: 7535, dtype: object
<===========================================================================>
类型为: <class 'pandas.core.series.Series'>
14.1.2 iloc 用法
14.1.2.1 传入整数索引选取一个
city.iloc[0]
'Normal'
14.1.2.2 传入整数列表选取一个新的Series
# 当传入列表时发现获取出来的数据结构依然是一个Series
city.iloc[[0,1,2,3]]
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Name: CITY, dtype: object
14.1.2.3 分片获取
# 获取整数索引 [0,10) 步长为2,这样选出的依然为Series
city[0:10:2]
INSTNM
Alabama A & M University Normal
Amridge University Montgomery
Alabama State University Montgomery
Central Alabama Community College Alexander City
Auburn University at Montgomery Montgomery
Name: CITY, dtype: object
14.1.3 loc 用法
传入索引标签选取一个
city["Alabama A & M University"]
'Normal'
14.1.3.1 通过标签列表选取多行
# 这样通过传入的标签索引列表选取多行,返回的依然是Series
city[["Alabama A & M University","Amridge University"]]
INSTNM
Alabama A & M University Normal
Amridge University Montgomery
Name: CITY, dtype: object
14.1.3.2 分片选取
# 选取标签索引 [start_target,end_target] 步长为1的行,返回的是Series,注意这里是端点值都能取到
city["Alabama A & M University":"University of Alabama in Huntsville":1]
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Name: CITY, dtype: object
14.2 选取DataFrame的行
14.2.1 iloc用法
14.2.1.1 传入一个整数索引值获取一行数据(返回类型为Series)
college_data.iloc[0]
CITY Normal
STABBR AL
HBCU 1
MENONLY 0
...
PCTFLOAN 0.8284
UG25ABV 0.1049
MD_EARN_WNE_P10 30300
GRAD_DEBT_MDN_SUPP 33888
Name: Alabama A & M University, Length: 26, dtype: object
14.2.1.2 传入一个整数索引列表,返回多行数据,类型为(DataFrame)
college_data.iloc[[1,3,5,7,9]]
|
CITY |
STABBR |
HBCU |
MENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
INSTNM |
|
|
|
|
|
|
|
|
|
---|
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
University of Alabama in Huntsville |
Huntsville |
AL |
0.0 |
0.0 |
... |
0.4596 |
0.2640 |
45500 |
24097 |
The University of Alabama |
Tuscaloosa |
AL |
0.0 |
0.0 |
... |
0.4010 |
0.0853 |
41900 |
23750 |
Athens State University |
Athens |
AL |
0.0 |
0.0 |
... |
0.6296 |
0.6410 |
39000 |
18595 |
Auburn University |
Auburn |
AL |
0.0 |
0.0 |
... |
0.3494 |
0.0415 |
45700 |
21831 |
5 rows × 26 columns
14.2.1.3 分片获取
# 获取索引为 [1,10) 步长为2 中的数据行,返回为DataFrame
college_data.iloc[1:10:2]
|
CITY |
STABBR |
HBCU |
MENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
INSTNM |
|
|
|
|
|
|
|
|
|
---|
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
University of Alabama in Huntsville |
Huntsville |
AL |
0.0 |
0.0 |
... |
0.4596 |
0.2640 |
45500 |
24097 |
The University of Alabama |
Tuscaloosa |
AL |
0.0 |
0.0 |
... |
0.4010 |
0.0853 |
41900 |
23750 |
Athens State University |
Athens |
AL |
0.0 |
0.0 |
... |
0.6296 |
0.6410 |
39000 |
18595 |
Auburn University |
Auburn |
AL |
0.0 |
0.0 |
... |
0.3494 |
0.0415 |
45700 |
21831 |
5 rows × 26 columns
14.2.2 loc用法
14.2.2.1 传入一个标签获取一行
# 获取标签索引对应的数据行,返回类型为Series
college_data.loc["University of Alabama at Birmingham"]
CITY Birmingham
STABBR AL
HBCU 0
MENONLY 0
...
PCTFLOAN 0.5214
UG25ABV 0.2422
MD_EARN_WNE_P10 39700
GRAD_DEBT_MDN_SUPP 21941.5
Name: University of Alabama at Birmingham, Length: 26, dtype: object
14.2.2.2 传入一个标签列表获取多行
# 根据传入的标签列表返回相应的数据行,返回类型为DataFrame
college_data.loc[["University of Alabama at Birmingham","The University of Alabama"]]
|
CITY |
STABBR |
HBCU |
MENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
INSTNM |
|
|
|
|
|
|
|
|
|
---|
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
The University of Alabama |
Tuscaloosa |
AL |
0.0 |
0.0 |
... |
0.4010 |
0.0853 |
41900 |
23750 |
2 rows × 26 columns
14.2.2.3 分片获取
# 获取[start_target,end_tartget] 步长为 1 的数据行,返回为DataFrame
college_data.loc["University of Alabama at Birmingham":"University of Alabama in Huntsville":1]
|
CITY |
STABBR |
HBCU |
MENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
INSTNM |
|
|
|
|
|
|
|
|
|
---|
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
Amridge University |
Montgomery |
AL |
0.0 |
0.0 |
... |
0.7795 |
0.8540 |
40100 |
23370 |
University of Alabama in Huntsville |
Huntsville |
AL |
0.0 |
0.0 |
... |
0.4596 |
0.2640 |
45500 |
24097 |
3 rows × 26 columns
14.3 同时选取DataFrame的行和列
14.3.1 获取前n行m列
14.3.1.1 用 iloc 方法实现
# 获取前面两行三列数据
college_data.iloc[:2,:3]
|
CITY |
STABBR |
HBCU |
---|
INSTNM |
|
|
|
---|
Alabama A & M University |
Normal |
AL |
1.0 |
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
14.3.1.2 用 loc方法 实现
# 获取行索引从[start_target,end_target]的行,和列索引为[start,end]的列
college_data.loc[:"University of Alabama at Birmingham",:"HBCU"]
|
CITY |
STABBR |
HBCU |
---|
INSTNM |
|
|
|
---|
Alabama A & M University |
Normal |
AL |
1.0 |
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
14.3.2 获取全部行中的前n列
14.3.2.1 用 iloc 方法实现
college_data.iloc[:,:2]
|
CITY |
STABBR |
---|
INSTNM |
|
|
---|
Alabama A & M University |
Normal |
AL |
University of Alabama at Birmingham |
Birmingham |
AL |
Amridge University |
Montgomery |
AL |
University of Alabama in Huntsville |
Huntsville |
AL |
... |
... |
... |
Rasmussen College - Overland Park |
Overland Park |
KS |
National Personal Training Institute of Cleveland |
Highland Heights |
OH |
Bay Area Medical Academy - San Jose Satellite Location |
San Jose |
CA |
Excel Learning Center-San Antonio South |
San Antonio |
TX |
7535 rows × 2 columns
14.3.2.2 用 loc 方法实现
college_data.loc[:,:"STABBR"]
|
CITY |
STABBR |
---|
INSTNM |
|
|
---|
Alabama A & M University |
Normal |
AL |
University of Alabama at Birmingham |
Birmingham |
AL |
Amridge University |
Montgomery |
AL |
University of Alabama in Huntsville |
Huntsville |
AL |
... |
... |
... |
Rasmussen College - Overland Park |
Overland Park |
KS |
National Personal Training Institute of Cleveland |
Highland Heights |
OH |
Bay Area Medical Academy - San Jose Satellite Location |
San Jose |
CA |
Excel Learning Center-San Antonio South |
San Antonio |
TX |
7535 rows × 2 columns
14.3.3 选取不连续的行和列
14.3.3.1 用 iloc 方法实现
college_data.iloc[[1,3,5,7],[2,4,6,8]]
|
HBCU |
WOMENONLY |
SATVRMID |
DISTANCEONLY |
---|
INSTNM |
|
|
|
|
---|
University of Alabama at Birmingham |
0.0 |
0.0 |
570.0 |
0.0 |
University of Alabama in Huntsville |
0.0 |
0.0 |
595.0 |
0.0 |
The University of Alabama |
0.0 |
0.0 |
555.0 |
0.0 |
Athens State University |
0.0 |
0.0 |
NaN |
0.0 |
14.3.3.2 用 loc 方法实现
# 这里实现的需求同上
college_data.loc[["University of Alabama at Birmingham","University of Alabama in Huntsville","The University of Alabama","Athens State University"],
["HBCU","WOMENONLY","SATVRMID","DISTANCEONLY"]]
|
HBCU |
WOMENONLY |
SATVRMID |
DISTANCEONLY |
---|
INSTNM |
|
|
|
|
---|
University of Alabama at Birmingham |
0.0 |
0.0 |
570.0 |
0.0 |
University of Alabama in Huntsville |
0.0 |
0.0 |
595.0 |
0.0 |
The University of Alabama |
0.0 |
0.0 |
555.0 |
0.0 |
Athens State University |
0.0 |
0.0 |
NaN |
0.0 |
14.3.4 选取某一个标量的值
14.3.4.1 用 iloc方法实现
# 选取第四行四列的值
college_data.iloc[3,3]
0.0
14.3.4.2 用 loc 方法实现
# 实现的需求同上
college_data.loc["Athens State University","MENONLY"]
0.0
14.3.4.3 使用 iat 快速获取标量
%timeit college_data.iloc[1000,3]
# 可以看到使用iat方法,时间上大概节约了一半
%timeit college_data.iat[1000,3]
7.95 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.74 µs ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
14.3.4.4 使用at快速获取标量
%timeit college_data.loc["Rasmussen College - Overland Park","CITY"]
# 同样发现使用at方法比loc的速度也快
%timeit college_data.at["Rasmussen College - Overland Park","CITY"]
6.41 µs ± 58.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.16 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
14.4 补充
14.4.1 惰性切片
# 同样试用于Series
college_data[2:10:2]
|
CITY |
STABBR |
HBCU |
MENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
INSTNM |
|
|
|
|
|
|
|
|
|
---|
Amridge University |
Montgomery |
AL |
0.0 |
0.0 |
... |
0.7795 |
0.8540 |
40100 |
23370 |
Alabama State University |
Montgomery |
AL |
1.0 |
0.0 |
... |
0.7554 |
0.1270 |
26600 |
33118.5 |
Central Alabama Community College |
Alexander City |
AL |
0.0 |
0.0 |
... |
0.3977 |
0.3153 |
27500 |
16127 |
Auburn University at Montgomery |
Montgomery |
AL |
0.0 |
0.0 |
... |
0.5803 |
0.2930 |
35000 |
21335 |
4 rows × 26 columns
# 利用标签索引获取
college_data[:"Central Alabama Community College"]
|
CITY |
STABBR |
HBCU |
MENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
INSTNM |
|
|
|
|
|
|
|
|
|
---|
Alabama A & M University |
Normal |
AL |
1.0 |
0.0 |
... |
0.8284 |
0.1049 |
30300 |
33888 |
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
Amridge University |
Montgomery |
AL |
0.0 |
0.0 |
... |
0.7795 |
0.8540 |
40100 |
23370 |
University of Alabama in Huntsville |
Huntsville |
AL |
0.0 |
0.0 |
... |
0.4596 |
0.2640 |
45500 |
24097 |
Alabama State University |
Montgomery |
AL |
1.0 |
0.0 |
... |
0.7554 |
0.1270 |
26600 |
33118.5 |
The University of Alabama |
Tuscaloosa |
AL |
0.0 |
0.0 |
... |
0.4010 |
0.0853 |
41900 |
23750 |
Central Alabama Community College |
Alexander City |
AL |
0.0 |
0.0 |
... |
0.3977 |
0.3153 |
27500 |
16127 |
7 rows × 26 columns
14.4.2 按照字母分片
# 按照字母分片必须先对标签索引进行排序
college_data.sort_index(ascending=True)["A":"E"]
|
CITY |
STABBR |
HBCU |
MENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
INSTNM |
|
|
|
|
|
|
|
|
|
---|
A & W Healthcare Educators |
New Orleans |
LA |
0.0 |
0.0 |
... |
0.8596 |
0.6667 |
NaN |
19022.5 |
A T Still University of Health Sciences |
Kirksville |
MO |
0.0 |
0.0 |
... |
NaN |
NaN |
219800 |
PrivacySuppressed |
ABC Beauty Academy |
Garland |
TX |
0.0 |
0.0 |
... |
0.0000 |
0.8286 |
NaN |
PrivacySuppressed |
ABC Beauty College Inc |
Arkadelphia |
AR |
0.0 |
0.0 |
... |
1.0000 |
0.4688 |
PrivacySuppressed |
16500 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
Durham Technical Community College |
Durham |
NC |
0.0 |
0.0 |
... |
0.1796 |
0.5961 |
27200 |
11069.5 |
Dutchess BOCES-Practical Nursing Program |
Poughkeepsie |
NY |
0.0 |
0.0 |
... |
0.6275 |
0.5430 |
36500 |
9500 |
Dutchess Community College |
Poughkeepsie |
NY |
0.0 |
0.0 |
... |
0.1936 |
0.1806 |
32500 |
10250 |
Dyersburg State Community College |
Dyersburg |
TN |
0.0 |
0.0 |
... |
0.2493 |
0.3097 |
26800 |
7475 |
1900 rows × 26 columns
# 当然还可以反向获取
college_data.sort_index()["E":"F"]
|
CITY |
STABBR |
HBCU |
MENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
INSTNM |
|
|
|
|
|
|
|
|
|
---|
E Q School of Hair Design |
Council Bluffs |
IA |
0.0 |
0.0 |
... |
0.6737 |
0.1471 |
18100 |
7830 |
ECPI University |
Virginia Beach |
VA |
0.0 |
0.0 |
... |
0.5001 |
0.6633 |
37000 |
20000 |
ECPI University-Charleston |
North Charleston |
SC |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
20000 |
ECPI University-Charlotte |
Charlotte |
NC |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
20000 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
Excelsior College |
Albany |
NY |
0.0 |
0.0 |
... |
0.0800 |
0.9337 |
PrivacySuppressed |
11010 |
Expertise Cosmetology Institute |
Las Vegas |
NV |
0.0 |
0.0 |
... |
1.0000 |
0.4828 |
PrivacySuppressed |
8450 |
Exposito School of Hair Design |
Amarillo |
TX |
0.0 |
0.0 |
... |
0.6267 |
0.3966 |
15100 |
PrivacySuppressed |
Expression College for Digital Arts |
Emeryville |
CA |
0.0 |
0.0 |
... |
0.7736 |
0.3955 |
PrivacySuppressed |
35662 |
381 rows × 26 columns
14.4.3 更换索引
college_data.set_index("CITY")
|
STABBR |
HBCU |
MENONLY |
WOMENONLY |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
CITY |
|
|
|
|
|
|
|
|
|
---|
Normal |
AL |
1.0 |
0.0 |
0.0 |
... |
0.8284 |
0.1049 |
30300 |
33888 |
Birmingham |
AL |
0.0 |
0.0 |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
Montgomery |
AL |
0.0 |
0.0 |
0.0 |
... |
0.7795 |
0.8540 |
40100 |
23370 |
Huntsville |
AL |
0.0 |
0.0 |
0.0 |
... |
0.4596 |
0.2640 |
45500 |
24097 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
Overland Park |
KS |
NaN |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
21163 |
Highland Heights |
OH |
NaN |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
6333 |
San Jose |
CA |
NaN |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
PrivacySuppressed |
San Antonio |
TX |
NaN |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
12125 |
7535 rows × 25 columns
14.4.4 复原索引
college_data.reset_index()
|
INSTNM |
CITY |
STABBR |
HBCU |
... |
PCTFLOAN |
UG25ABV |
MD_EARN_WNE_P10 |
GRAD_DEBT_MDN_SUPP |
---|
0 |
Alabama A & M University |
Normal |
AL |
1.0 |
... |
0.8284 |
0.1049 |
30300 |
33888 |
1 |
University of Alabama at Birmingham |
Birmingham |
AL |
0.0 |
... |
0.5214 |
0.2422 |
39700 |
21941.5 |
2 |
Amridge University |
Montgomery |
AL |
0.0 |
... |
0.7795 |
0.8540 |
40100 |
23370 |
3 |
University of Alabama in Huntsville |
Huntsville |
AL |
0.0 |
... |
0.4596 |
0.2640 |
45500 |
24097 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
7531 |
Rasmussen College - Overland Park |
Overland Park |
KS |
NaN |
... |
NaN |
NaN |
NaN |
21163 |
7532 |
National Personal Training Institute of Cleveland |
Highland Heights |
OH |
NaN |
... |
NaN |
NaN |
NaN |
6333 |
7533 |
Bay Area Medical Academy - San Jose Satellite ... |
San Jose |
CA |
NaN |
... |
NaN |
NaN |
NaN |
PrivacySuppressed |
7534 |
Excel Learning Center-San Antonio South |
San Antonio |
TX |
NaN |
... |
NaN |
NaN |
NaN |
12125 |
7535 rows × 27 columns