注:本教程为系列教程此章节接前面第一弹

本文向导

15 布尔索引

15.1 导入数据

# 以movie_title为索引列
movies = pd.read_csv("./pandasLearnData/movie.csv",index_col="movie_title")
movies.head(5)
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_title
Avatar Color James Cameron 723.0 178.0 ... 936.0 7.9 1.78 33000
Pirates of the Caribbean: At World's End Color Gore Verbinski 302.0 169.0 ... 5000.0 7.1 2.35 0
Spectre Color Sam Mendes 602.0 148.0 ... 393.0 6.8 2.35 85000
The Dark Knight Rises Color Christopher Nolan 813.0 164.0 ... 23000.0 8.5 2.35 164000
Star Wars: Episode VII - The Force Awakens NaN Doug Walker NaN NaN ... 12.0 7.1 NaN 0

5 rows × 27 columns

15.2 构建布尔索引

# 找出时长在两个小时以上的电影
movie_2_hours = movies.duration > 120
print("类别:",type(movie_2_hours))
print("<"+"="*75+">")
print(movie_2_hours)
# 可以看见经过比较运算符后,我们得到了一个以movie_title为索引,值为布尔类型的Series
# 这就是布尔索引
类别: <class 'pandas.core.series.Series'>
<===========================================================================>
movie_title
Avatar                                       True
Pirates of the Caribbean: At World's End     True
Spectre                                      True
The Dark Knight Rises                        True
                                            ...  
The Following                               False
A Plague So Pleasant                        False
Shanghai Calling                            False
My Date with Drew                           False
Name: duration, Length: 4916, dtype: bool

15.3 统计布尔值

# 统计电影时长超过两小时的电影数量
movie_2_hours.sum()
1039
# 统计时长超过两个小时的电影所占比例
movie_2_hours.mean()
0.2113506916192026
# 因为原来duration字段原来有空缺值
print("duration字段有是否空缺值:",movies.duration.notnull().any())
# 所以先去掉空缺值再计算
movies.duration.dropna().gt(120).mean()
duration字段有是否空缺值: True





0.21199755152009794

15.4 比较同一个DataFrame中的两列

# 电影中女二号脸书粉丝数大于女二号的电影
movies.actor_1_facebook_likes < movies.actor_2_facebook_likes
movie_title
Avatar                                      False
Pirates of the Caribbean: At World's End    False
Spectre                                     False
The Dark Knight Rises                       False
                                            ...  
The Following                               False
A Plague So Pleasant                        False
Shanghai Calling                            False
My Date with Drew                           False
Length: 4916, dtype: bool

15.5 any() 和 all() 的区别

# all() 全真时才为真
print(pd.Series([True,True,True]).all())
# all() 有一个假时即返回假
print(pd.Series([True,False,True]).all())
True
False
# any() 全假时才为假
print(pd.Series([False,False,False]).any())
# any() 有一个真时即返回真
print(pd.Series([False,False,True]).any())
False
True

15.6 多个布尔索引的逻辑运算

!!!注意逻辑运算符的优先级高于比较运算符所以要打括号!!!

15.6.1 与运算

# 电影时长大于两个小时且女一号的粉丝数大于1000
(movies["duration"] > 120) & (movies["actor_1_facebook_likes"] > 1000)
movie_title
Avatar                                      False
Pirates of the Caribbean: At World's End     True
Spectre                                      True
The Dark Knight Rises                        True
                                            ...  
The Following                               False
A Plague So Pleasant                        False
Shanghai Calling                            False
My Date with Drew                           False
Length: 4916, dtype: bool

15.6.2 或运算

# 电影时长大于120 或者 小于 100的电影
(movies["duration"] < 100) | (movies["duration"] > 120)
movie_title
Avatar                                       True
Pirates of the Caribbean: At World's End     True
Spectre                                      True
The Dark Knight Rises                        True
                                            ...  
The Following                                True
A Plague So Pleasant                         True
Shanghai Calling                            False
My Date with Drew                            True
Name: duration, Length: 4916, dtype: bool

15.6.3 非运算

# 找出时长不大于120的电影
~ (movies["duration"] > 120)
movie_title
Avatar                                      False
Pirates of the Caribbean: At World's End    False
Spectre                                     False
The Dark Knight Rises                       False
                                            ...  
The Following                                True
A Plague So Pleasant                         True
Shanghai Calling                             True
My Date with Drew                            True
Name: duration, Length: 4916, dtype: bool

15.6.4 in运算

# 找出电影时长在集合中的电影
movies["duration"].isin([90.0,100.0,120.0])
movie_title
Avatar                                      False
Pirates of the Caribbean: At World's End    False
Spectre                                     False
The Dark Knight Rises                       False
                                            ...  
The Following                               False
A Plague So Pleasant                        False
Shanghai Calling                             True
My Date with Drew                            True
Name: duration, Length: 4916, dtype: bool

15.7 使用布尔索引获取数据

# 取数据集当中电影时长大于两个小时的行
movies[movie_2_hours]
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_title
Avatar Color James Cameron 723.0 178.0 ... 936.0 7.9 1.78 33000
Pirates of the Caribbean: At World's End Color Gore Verbinski 302.0 169.0 ... 5000.0 7.1 2.35 0
Spectre Color Sam Mendes 602.0 148.0 ... 393.0 6.8 2.35 85000
The Dark Knight Rises Color Christopher Nolan 813.0 164.0 ... 23000.0 8.5 2.35 164000
... ... ... ... ... ... ... ... ... ...
Intolerance: Love's Struggle Throughout the Ages Black and White D.W. Griffith 69.0 123.0 ... 22.0 8.0 1.33 691
The Big Parade Black and White King Vidor 48.0 151.0 ... 12.0 8.3 1.33 226
Ordet Black and White Carl Theodor Dreyer 54.0 126.0 ... 0.0 8.1 1.37 863
The Ridges NaN Brandon Landers NaN 143.0 ... 19.0 3.0 NaN 33

1039 rows × 27 columns

# 获取女一号脸书粉丝数大于1万且电影时长超过两小时的行
movies[(movies["actor_1_facebook_likes"] > 10000) & movie_2_hours]
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_title
Pirates of the Caribbean: At World's End Color Gore Verbinski 302.0 169.0 ... 5000.0 7.1 2.35 0
Spectre Color Sam Mendes 602.0 148.0 ... 393.0 6.8 2.35 85000
The Dark Knight Rises Color Christopher Nolan 813.0 164.0 ... 23000.0 8.5 2.35 164000
Spider-Man 3 Color Sam Raimi 392.0 156.0 ... 11000.0 6.2 2.35 0
... ... ... ... ... ... ... ... ... ...
That Thing You Do! Color Tom Hanks 75.0 149.0 ... 9000.0 6.9 1.37 0
Stonewall Color Roland Emmerich 74.0 129.0 ... 463.0 4.5 2.35 0
The Good, the Bad and the Ugly Color Sergio Leone 181.0 142.0 ... 34.0 8.9 2.35 20000
Rocky Color John G. Avildsen 141.0 145.0 ... 1000.0 8.1 1.33 0

435 rows × 27 columns

15.8 扩展

15.8.1 使用标签索引代替布尔索引

15.8.1.1 布尔索引法

# 找出所有德克萨斯州的学校信息
college_data[college_data["STABBR"] == "TX"]
CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Abilene Christian University Abilene TX 0.0 0.0 ... 0.5527 0.0381 40200 25985
Alvin Community College Alvin TX 0.0 0.0 ... 0.0625 0.2841 34500 6750
Amarillo College Amarillo TX 0.0 0.0 ... 0.1573 0.3431 31700 10950
Angelina College Lufkin TX 0.0 0.0 ... 0.0000 0.2603 26900 PrivacySuppressed
... ... ... ... ... ... ... ... ... ...
Strayer University-San Antonio San Antonio TX NaN NaN ... NaN NaN NaN 36173.5
Strayer University-Stafford Stafford TX NaN NaN ... NaN NaN NaN 36173.5
Vantage College El Paso TX NaN NaN ... NaN NaN NaN 9500
Excel Learning Center-San Antonio South San Antonio TX NaN NaN ... NaN NaN NaN 12125

472 rows × 26 columns

15.8.1.2 标签索引法

# 先设置 STABBR 字段为行索引
college_data2 = college_data.set_index("STABBR")
college_data2.loc["TX"]
CITY HBCU MENONLY WOMENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
STABBR
TX Abilene 0.0 0.0 0.0 ... 0.5527 0.0381 40200 25985
TX Alvin 0.0 0.0 0.0 ... 0.0625 0.2841 34500 6750
TX Amarillo 0.0 0.0 0.0 ... 0.1573 0.3431 31700 10950
TX Lufkin 0.0 0.0 0.0 ... 0.0000 0.2603 26900 PrivacySuppressed
... ... ... ... ... ... ... ... ... ...
TX San Antonio NaN NaN NaN ... NaN NaN NaN 36173.5
TX Stafford NaN NaN NaN ... NaN NaN NaN 36173.5
TX El Paso NaN NaN NaN ... NaN NaN NaN 9500
TX San Antonio NaN NaN NaN ... NaN NaN NaN 12125

472 rows × 25 columns

15.8.1.3 比较两种方法的速度

%timeit college_data[college_data["STABBR"] == "TX"]
%timeit college_data2.loc["TX"]
# 可以看到标签索引法相较于布尔索引法在时间上差了整整两个数量级
1.12 ms ± 84.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
559 µs ± 52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

15.8.2 使用有序索引进行获取

15.8.2.1 排序前

# 没排序之前查看索引是否有序
college_data2.index.is_monotonic
False

15.8.2.2 排序后

college_data3 = college_data2.sort_index(ascending=True)

15.8.2.3 排序前后的速度

%timeit college_data2.loc["TX"]
%timeit college_data3.loc["TX"]
# 克见排序后的获取数据的时间快了一倍
570 µs ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
179 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

15.8.3 使用唯一索引获取数据

15.8.3.1 判断索引是否唯一

college_data.index.is_unique
True

15.8.3.2 再比较两种的速度

%timeit college_data.loc["Stanford University"]
college_data4 = college_data.sort_index()
%timeit college_data4.loc["Stanford University"]
157 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
159 µs ± 855 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

15.8.4 使用query方法提高可读性

c = "Color"
movies.query("100 <= duration <= 120 and actor_1_facebook_likes > 10000 and color==@c")
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_title
The Golden Compass Color Chris Weitz 251.0 113.0 ... 6000.0 6.1 2.35 0
Alice in Wonderland Color Tim Burton 451.0 108.0 ... 25000.0 6.5 1.85 24000
X-Men: The Last Stand Color Brett Ratner 334.0 104.0 ... 808.0 6.8 2.35 0
Monsters University Color Dan Scanlon 376.0 104.0 ... 779.0 7.3 1.85 44000
... ... ... ... ... ... ... ... ... ...
The Slaughter Rule Color Alex Smith 17.0 112.0 ... 1000.0 6.1 2.35 183
Now Is Good Color Ol Parker 48.0 103.0 ... 766.0 7.2 2.35 0
Chasing Amy Color Kevin Smith 147.0 113.0 ... 1000.0 7.3 1.85 0
The Grace Card Color David G. Evans 25.0 101.0 ... 21.0 6.4 NaN 0

543 rows × 27 columns

movies.where(movies.duration < 60)
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_title
Avatar NaN NaN NaN NaN ... NaN NaN NaN NaN
Pirates of the Caribbean: At World's End NaN NaN NaN NaN ... NaN NaN NaN NaN
Spectre NaN NaN NaN NaN ... NaN NaN NaN NaN
The Dark Knight Rises NaN NaN NaN NaN ... NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
The Following Color NaN 43.0 43.0 ... 593.0 7.5 16.0 32000.0
A Plague So Pleasant NaN NaN NaN NaN ... NaN NaN NaN NaN
Shanghai Calling NaN NaN NaN NaN ... NaN NaN NaN NaN
My Date with Drew NaN NaN NaN NaN ... NaN NaN NaN NaN

4916 rows × 27 columns

15.8.5 使用where对Series进行查询

ac1_fb_likes = movies["actor_1_facebook_likes"].dropna()
# 使用were时会返回一个同等大小的Series,不满足条件的值会进行填充,通过other参数进行设置,默认为np.NaN
ac1_fb_likes.where(ac1_fb_likes > 300,other=-1).where(ac1_fb_likes < 10000,other=-1)
movie_title
Avatar                                      1000.0
Pirates of the Caribbean: At World's End      -1.0
Spectre                                       -1.0
The Dark Knight Rises                         -1.0
                                             ...  
The Following                                841.0
A Plague So Pleasant                          -1.0
Shanghai Calling                             946.0
My Date with Drew                             -1.0
Name: actor_1_facebook_likes, Length: 4909, dtype: float64

15.8.6 DataFrame的mask操作

# mask 操作不会真的将数据删除,只是会将满足条件的行的所有字段置为空值
movies.mask(movies.actor_1_facebook_likes <= 1000)
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_title
Avatar NaN NaN NaN NaN ... NaN NaN NaN NaN
Pirates of the Caribbean: At World's End Color Gore Verbinski 302.0 169.0 ... 5000.0 7.1 2.35 0.0
Spectre Color Sam Mendes 602.0 148.0 ... 393.0 6.8 2.35 85000.0
The Dark Knight Rises Color Christopher Nolan 813.0 164.0 ... 23000.0 8.5 2.35 164000.0
... ... ... ... ... ... ... ... ... ...
The Following NaN NaN NaN NaN ... NaN NaN NaN NaN
A Plague So Pleasant NaN NaN NaN NaN ... NaN NaN NaN NaN
Shanghai Calling NaN NaN NaN NaN ... NaN NaN NaN NaN
My Date with Drew NaN NaN NaN NaN ... NaN NaN NaN NaN

4916 rows × 27 columns

# 利用mask操作和dropna可以很灵活的删除想删除的数据
# dropna 的all参数代表要整行都为空才能删除,默认值为any表示只要字段中有一个为空即删除
movies.mask(movies.actor_1_facebook_likes <= 1000).dropna(how="all")
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_title
Pirates of the Caribbean: At World's End Color Gore Verbinski 302.0 169.0 ... 5000.0 7.1 2.35 0.0
Spectre Color Sam Mendes 602.0 148.0 ... 393.0 6.8 2.35 85000.0
The Dark Knight Rises Color Christopher Nolan 813.0 164.0 ... 23000.0 8.5 2.35 164000.0
Spider-Man 3 Color Sam Raimi 392.0 156.0 ... 11000.0 6.2 2.35 0.0
... ... ... ... ... ... ... ... ... ...
Cheap Thrills Color E.L. Katz 193.0 88.0 ... 982.0 6.8 2.35 0.0
Happy Christmas Color Joe Swanberg 65.0 82.0 ... 969.0 5.6 1.85 812.0
Counting Color Jem Cohen 12.0 111.0 ... NaN 6.0 1.78 5.0
Smiling Fish & Goat on Fire Color Kevin Jordan 21.0 90.0 ... 467.0 7.6 1.85 0.0

1966 rows × 27 columns

15.8.7 使用布尔值、整数、标签进行选取

# 根据布尔条件选取 这里使用loc方法
condition = movies.actor_1_facebook_likes < 1000
movies.loc[condition]
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_title
Star Wars: Episode VII - The Force Awakens NaN Doug Walker NaN NaN ... 12.0 7.1 NaN 0
John Carter Color Andrew Stanton 462.0 132.0 ... 632.0 6.6 2.35 24000
Tangled Color Nathan Greno 324.0 100.0 ... 553.0 7.8 1.85 29000
Quantum of Solace Color Marc Forster 403.0 106.0 ... 412.0 6.7 2.35 0
... ... ... ... ... ... ... ... ... ...
The Following Color NaN 43.0 43.0 ... 593.0 7.5 16.00 32000
A Plague So Pleasant Color Benjamin Roberds 13.0 76.0 ... 0.0 6.3 NaN 16
Shanghai Calling Color Daniel Hsia 14.0 100.0 ... 719.0 6.3 2.35 660
My Date with Drew Color Jon Gunn 43.0 90.0 ... 23.0 6.6 1.85 456

2514 rows × 27 columns

# 比较是否和布尔索引方法得出的结果一致
movies.loc[condition].equals(movies[condition])
True
# 使用iloc利用整数进行获取
# 这里就必须要提取Series中的boll值,返回的也就是一个类型为array的bool数组
movies.iloc[condition.values]
color director_name num_critic_for_reviews duration ... actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
movie_title
Star Wars: Episode VII - The Force Awakens NaN Doug Walker NaN NaN ... 12.0 7.1 NaN 0
John Carter Color Andrew Stanton 462.0 132.0 ... 632.0 6.6 2.35 24000
Tangled Color Nathan Greno 324.0 100.0 ... 553.0 7.8 1.85 29000
Quantum of Solace Color Marc Forster 403.0 106.0 ... 412.0 6.7 2.35 0
... ... ... ... ... ... ... ... ... ...
The Following Color NaN 43.0 43.0 ... 593.0 7.5 16.00 32000
A Plague So Pleasant Color Benjamin Roberds 13.0 76.0 ... 0.0 6.3 NaN 16
Shanghai Calling Color Daniel Hsia 14.0 100.0 ... 719.0 6.3 2.35 660
My Date with Drew Color Jon Gunn 43.0 90.0 ... 23.0 6.6 1.85 456

2514 rows × 27 columns

# 利用布尔数组选取相应类型的列
movies.loc[:,movies.dtypes == "object"]
# 同理 使用iloc时要取其 array数组
color director_name actor_2_name genres ... movie_imdb_link language country content_rating
movie_title
Avatar Color James Cameron Joel David Moore Action|Adventure|Fantasy|Sci-Fi ... http://www.imdb.com/title/tt0499549/?ref_=fn_t... English USA PG-13
Pirates of the Caribbean: At World's End Color Gore Verbinski Orlando Bloom Action|Adventure|Fantasy ... http://www.imdb.com/title/tt0449088/?ref_=fn_t... English USA PG-13
Spectre Color Sam Mendes Rory Kinnear Action|Adventure|Thriller ... http://www.imdb.com/title/tt2379713/?ref_=fn_t... English UK PG-13
The Dark Knight Rises Color Christopher Nolan Christian Bale Action|Thriller ... http://www.imdb.com/title/tt1345836/?ref_=fn_t... English USA PG-13
... ... ... ... ... ... ... ... ... ...
The Following Color NaN Valorie Curry Crime|Drama|Mystery|Thriller ... http://www.imdb.com/title/tt2071645/?ref_=fn_t... English USA TV-14
A Plague So Pleasant Color Benjamin Roberds Maxwell Moody Drama|Horror|Thriller ... http://www.imdb.com/title/tt2107644/?ref_=fn_t... English USA NaN
Shanghai Calling Color Daniel Hsia Daniel Henney Comedy|Drama|Romance ... http://www.imdb.com/title/tt2070597/?ref_=fn_t... English USA PG-13
My Date with Drew Color Jon Gunn Brian Herzlinger Documentary ... http://www.imdb.com/title/tt0378407/?ref_=fn_t... English USA PG

4916 rows × 11 columns