注:本教程为系列教程此章节接前面第一弹

本文向导

14 选取数据的子集

14.1 选取Series的行

14.1.1 从DataFrame中获取一列作为Series

city = college_data["CITY"]
print(city)
print("<"+"="*75+">")
print("类型为:",type(city))
INSTNM
Alabama A & M University                                            Normal
University of Alabama at Birmingham                             Birmingham
Amridge University                                              Montgomery
University of Alabama in Huntsville                             Huntsville
                                                                ...       
Rasmussen College - Overland Park                            Overland Park
National Personal Training Institute of Cleveland         Highland Heights
Bay Area Medical Academy - San Jose Satellite Location            San Jose
Excel Learning Center-San Antonio South                        San Antonio
Name: CITY, Length: 7535, dtype: object
<===========================================================================>
类型为: <class 'pandas.core.series.Series'>

14.1.2 iloc 用法

14.1.2.1 传入整数索引选取一个

city.iloc[0]
'Normal'

14.1.2.2 传入整数列表选取一个新的Series

# 当传入列表时发现获取出来的数据结构依然是一个Series
city.iloc[[0,1,2,3]]
INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Name: CITY, dtype: object

14.1.2.3 分片获取

# 获取整数索引 [0,10) 步长为2,这样选出的依然为Series
city[0:10:2]
INSTNM
Alabama A & M University                     Normal
Amridge University                       Montgomery
Alabama State University                 Montgomery
Central Alabama Community College    Alexander City
Auburn University at Montgomery          Montgomery
Name: CITY, dtype: object

14.1.3 loc 用法

传入索引标签选取一个

city["Alabama A & M University"]
'Normal'

14.1.3.1 通过标签列表选取多行

# 这样通过传入的标签索引列表选取多行,返回的依然是Series
city[["Alabama A & M University","Amridge University"]]
INSTNM
Alabama A & M University        Normal
Amridge University          Montgomery
Name: CITY, dtype: object

14.1.3.2 分片选取

# 选取标签索引 [start_target,end_target] 步长为1的行,返回的是Series,注意这里是端点值都能取到
city["Alabama A & M University":"University of Alabama in Huntsville":1]
INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Name: CITY, dtype: object

14.2 选取DataFrame的行

14.2.1 iloc用法

14.2.1.1 传入一个整数索引值获取一行数据(返回类型为Series)

college_data.iloc[0]
CITY                  Normal
STABBR                    AL
HBCU                       1
MENONLY                    0
                       ...  
PCTFLOAN              0.8284
UG25ABV               0.1049
MD_EARN_WNE_P10        30300
GRAD_DEBT_MDN_SUPP     33888
Name: Alabama A & M University, Length: 26, dtype: object

14.2.1.2 传入一个整数索引列表,返回多行数据,类型为(DataFrame)

college_data.iloc[[1,3,5,7,9]]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
University of Alabama in Huntsville Huntsville AL 0.0 0.0 ... 0.4596 0.2640 45500 24097
The University of Alabama Tuscaloosa AL 0.0 0.0 ... 0.4010 0.0853 41900 23750
Athens State University Athens AL 0.0 0.0 ... 0.6296 0.6410 39000 18595
Auburn University Auburn AL 0.0 0.0 ... 0.3494 0.0415 45700 21831

5 rows × 26 columns

14.2.1.3 分片获取

# 获取索引为 [1,10) 步长为2 中的数据行,返回为DataFrame
college_data.iloc[1:10:2]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
University of Alabama in Huntsville Huntsville AL 0.0 0.0 ... 0.4596 0.2640 45500 24097
The University of Alabama Tuscaloosa AL 0.0 0.0 ... 0.4010 0.0853 41900 23750
Athens State University Athens AL 0.0 0.0 ... 0.6296 0.6410 39000 18595
Auburn University Auburn AL 0.0 0.0 ... 0.3494 0.0415 45700 21831

5 rows × 26 columns

14.2.2 loc用法

14.2.2.1 传入一个标签获取一行

# 获取标签索引对应的数据行,返回类型为Series
college_data.loc["University of Alabama at Birmingham"]
CITY                  Birmingham
STABBR                        AL
HBCU                           0
MENONLY                        0
                         ...    
PCTFLOAN                  0.5214
UG25ABV                   0.2422
MD_EARN_WNE_P10            39700
GRAD_DEBT_MDN_SUPP       21941.5
Name: University of Alabama at Birmingham, Length: 26, dtype: object

14.2.2.2 传入一个标签列表获取多行

# 根据传入的标签列表返回相应的数据行,返回类型为DataFrame
college_data.loc[["University of Alabama at Birmingham","The University of Alabama"]]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
The University of Alabama Tuscaloosa AL 0.0 0.0 ... 0.4010 0.0853 41900 23750

2 rows × 26 columns

14.2.2.3 分片获取

# 获取[start_target,end_tartget] 步长为 1 的数据行,返回为DataFrame
college_data.loc["University of Alabama at Birmingham":"University of Alabama in Huntsville":1]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
Amridge University Montgomery AL 0.0 0.0 ... 0.7795 0.8540 40100 23370
University of Alabama in Huntsville Huntsville AL 0.0 0.0 ... 0.4596 0.2640 45500 24097

3 rows × 26 columns

14.3 同时选取DataFrame的行和列

14.3.1 获取前n行m列

14.3.1.1 用 iloc 方法实现

# 获取前面两行三列数据
college_data.iloc[:2,:3]
CITY STABBR HBCU
INSTNM
Alabama A & M University Normal AL 1.0
University of Alabama at Birmingham Birmingham AL 0.0

14.3.1.2 用 loc方法 实现

# 获取行索引从[start_target,end_target]的行,和列索引为[start,end]的列
college_data.loc[:"University of Alabama at Birmingham",:"HBCU"]
CITY STABBR HBCU
INSTNM
Alabama A & M University Normal AL 1.0
University of Alabama at Birmingham Birmingham AL 0.0

14.3.2 获取全部行中的前n列

14.3.2.1 用 iloc 方法实现

college_data.iloc[:,:2]
CITY STABBR
INSTNM
Alabama A & M University Normal AL
University of Alabama at Birmingham Birmingham AL
Amridge University Montgomery AL
University of Alabama in Huntsville Huntsville AL
... ... ...
Rasmussen College - Overland Park Overland Park KS
National Personal Training Institute of Cleveland Highland Heights OH
Bay Area Medical Academy - San Jose Satellite Location San Jose CA
Excel Learning Center-San Antonio South San Antonio TX

7535 rows × 2 columns

14.3.2.2 用 loc 方法实现

college_data.loc[:,:"STABBR"]
CITY STABBR
INSTNM
Alabama A & M University Normal AL
University of Alabama at Birmingham Birmingham AL
Amridge University Montgomery AL
University of Alabama in Huntsville Huntsville AL
... ... ...
Rasmussen College - Overland Park Overland Park KS
National Personal Training Institute of Cleveland Highland Heights OH
Bay Area Medical Academy - San Jose Satellite Location San Jose CA
Excel Learning Center-San Antonio South San Antonio TX

7535 rows × 2 columns

14.3.3 选取不连续的行和列

14.3.3.1 用 iloc 方法实现

college_data.iloc[[1,3,5,7],[2,4,6,8]]
HBCU WOMENONLY SATVRMID DISTANCEONLY
INSTNM
University of Alabama at Birmingham 0.0 0.0 570.0 0.0
University of Alabama in Huntsville 0.0 0.0 595.0 0.0
The University of Alabama 0.0 0.0 555.0 0.0
Athens State University 0.0 0.0 NaN 0.0

14.3.3.2 用 loc 方法实现

# 这里实现的需求同上
college_data.loc[["University of Alabama at Birmingham","University of Alabama in Huntsville","The University of Alabama","Athens State University"],
                ["HBCU","WOMENONLY","SATVRMID","DISTANCEONLY"]]
HBCU WOMENONLY SATVRMID DISTANCEONLY
INSTNM
University of Alabama at Birmingham 0.0 0.0 570.0 0.0
University of Alabama in Huntsville 0.0 0.0 595.0 0.0
The University of Alabama 0.0 0.0 555.0 0.0
Athens State University 0.0 0.0 NaN 0.0

14.3.4 选取某一个标量的值

14.3.4.1 用 iloc方法实现

# 选取第四行四列的值
college_data.iloc[3,3]
0.0

14.3.4.2 用 loc 方法实现

# 实现的需求同上
college_data.loc["Athens State University","MENONLY"]
0.0

14.3.4.3 使用 iat 快速获取标量

%timeit college_data.iloc[1000,3]
# 可以看到使用iat方法,时间上大概节约了一半
%timeit college_data.iat[1000,3]
7.95 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.74 µs ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

14.3.4.4 使用at快速获取标量

%timeit college_data.loc["Rasmussen College - Overland Park","CITY"]
# 同样发现使用at方法比loc的速度也快
%timeit college_data.at["Rasmussen College - Overland Park","CITY"]
6.41 µs ± 58.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.16 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

14.4 补充

14.4.1 惰性切片

# 同样试用于Series
college_data[2:10:2]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Amridge University Montgomery AL 0.0 0.0 ... 0.7795 0.8540 40100 23370
Alabama State University Montgomery AL 1.0 0.0 ... 0.7554 0.1270 26600 33118.5
Central Alabama Community College Alexander City AL 0.0 0.0 ... 0.3977 0.3153 27500 16127
Auburn University at Montgomery Montgomery AL 0.0 0.0 ... 0.5803 0.2930 35000 21335

4 rows × 26 columns

# 利用标签索引获取
college_data[:"Central Alabama Community College"]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Alabama A & M University Normal AL 1.0 0.0 ... 0.8284 0.1049 30300 33888
University of Alabama at Birmingham Birmingham AL 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
Amridge University Montgomery AL 0.0 0.0 ... 0.7795 0.8540 40100 23370
University of Alabama in Huntsville Huntsville AL 0.0 0.0 ... 0.4596 0.2640 45500 24097
Alabama State University Montgomery AL 1.0 0.0 ... 0.7554 0.1270 26600 33118.5
The University of Alabama Tuscaloosa AL 0.0 0.0 ... 0.4010 0.0853 41900 23750
Central Alabama Community College Alexander City AL 0.0 0.0 ... 0.3977 0.3153 27500 16127

7 rows × 26 columns

14.4.2 按照字母分片

# 按照字母分片必须先对标签索引进行排序
college_data.sort_index(ascending=True)["A":"E"]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
A & W Healthcare Educators New Orleans LA 0.0 0.0 ... 0.8596 0.6667 NaN 19022.5
A T Still University of Health Sciences Kirksville MO 0.0 0.0 ... NaN NaN 219800 PrivacySuppressed
ABC Beauty Academy Garland TX 0.0 0.0 ... 0.0000 0.8286 NaN PrivacySuppressed
ABC Beauty College Inc Arkadelphia AR 0.0 0.0 ... 1.0000 0.4688 PrivacySuppressed 16500
... ... ... ... ... ... ... ... ... ...
Durham Technical Community College Durham NC 0.0 0.0 ... 0.1796 0.5961 27200 11069.5
Dutchess BOCES-Practical Nursing Program Poughkeepsie NY 0.0 0.0 ... 0.6275 0.5430 36500 9500
Dutchess Community College Poughkeepsie NY 0.0 0.0 ... 0.1936 0.1806 32500 10250
Dyersburg State Community College Dyersburg TN 0.0 0.0 ... 0.2493 0.3097 26800 7475

1900 rows × 26 columns

# 当然还可以反向获取
college_data.sort_index()["E":"F"]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
E Q School of Hair Design Council Bluffs IA 0.0 0.0 ... 0.6737 0.1471 18100 7830
ECPI University Virginia Beach VA 0.0 0.0 ... 0.5001 0.6633 37000 20000
ECPI University-Charleston North Charleston SC NaN NaN ... NaN NaN NaN 20000
ECPI University-Charlotte Charlotte NC NaN NaN ... NaN NaN NaN 20000
... ... ... ... ... ... ... ... ... ...
Excelsior College Albany NY 0.0 0.0 ... 0.0800 0.9337 PrivacySuppressed 11010
Expertise Cosmetology Institute Las Vegas NV 0.0 0.0 ... 1.0000 0.4828 PrivacySuppressed 8450
Exposito School of Hair Design Amarillo TX 0.0 0.0 ... 0.6267 0.3966 15100 PrivacySuppressed
Expression College for Digital Arts Emeryville CA 0.0 0.0 ... 0.7736 0.3955 PrivacySuppressed 35662

381 rows × 26 columns

14.4.3 更换索引

college_data.set_index("CITY")
STABBR HBCU MENONLY WOMENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
CITY
Normal AL 1.0 0.0 0.0 ... 0.8284 0.1049 30300 33888
Birmingham AL 0.0 0.0 0.0 ... 0.5214 0.2422 39700 21941.5
Montgomery AL 0.0 0.0 0.0 ... 0.7795 0.8540 40100 23370
Huntsville AL 0.0 0.0 0.0 ... 0.4596 0.2640 45500 24097
... ... ... ... ... ... ... ... ... ...
Overland Park KS NaN NaN NaN ... NaN NaN NaN 21163
Highland Heights OH NaN NaN NaN ... NaN NaN NaN 6333
San Jose CA NaN NaN NaN ... NaN NaN NaN PrivacySuppressed
San Antonio TX NaN NaN NaN ... NaN NaN NaN 12125

7535 rows × 25 columns

14.4.4 复原索引

college_data.reset_index()
INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
3 University of Alabama in Huntsville Huntsville AL 0.0 ... 0.4596 0.2640 45500 24097
... ... ... ... ... ... ... ... ... ...
7531 Rasmussen College - Overland Park Overland Park KS NaN ... NaN NaN NaN 21163
7532 National Personal Training Institute of Cleveland Highland Heights OH NaN ... NaN NaN NaN 6333
7533 Bay Area Medical Academy - San Jose Satellite ... San Jose CA NaN ... NaN NaN NaN PrivacySuppressed
7534 Excel Learning Center-San Antonio South San Antonio TX NaN ... NaN NaN NaN 12125

7535 rows × 27 columns