注:本教程为系列教程此章节接前面第一弹
跳转到专题地址
9.选取多个DataFrame
9.1 用列表选取多列
# 选取不同演员的FaceBook的粉丝数
movies[["actor_1_facebook_likes","actor_2_facebook_likes","actor_3_facebook_likes"]]
# 错误示例:
# movies["actor_1_facebook_likes","actor_2_facebook_likes","actor_3_facebook_likes"]
|
actor_1_facebook_likes |
actor_2_facebook_likes |
actor_3_facebook_likes |
---|
0 |
1000.0 |
936.0 |
855.0 |
1 |
40000.0 |
5000.0 |
1000.0 |
2 |
11000.0 |
393.0 |
161.0 |
3 |
27000.0 |
23000.0 |
23000.0 |
... |
... |
... |
... |
4912 |
841.0 |
593.0 |
319.0 |
4913 |
0.0 |
0.0 |
0.0 |
4914 |
946.0 |
719.0 |
489.0 |
4915 |
86.0 |
23.0 |
16.0 |
4916 rows × 3 columns
9.2 选取单列
movies[["actor_1_facebook_likes"]]
|
actor_1_facebook_likes |
---|
0 |
1000.0 |
1 |
40000.0 |
2 |
11000.0 |
3 |
27000.0 |
... |
... |
4912 |
841.0 |
4913 |
0.0 |
4914 |
946.0 |
4915 |
86.0 |
4916 rows × 1 columns
!!!注意在这里要区别下面的这种取法!!!
print(type(movies["actor_1_facebook_likes"]))
print(type(movies[["actor_1_facebook_likes"]]))
# 一种取出的是DataFrame 一种取出的是Series
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
9.3 通过类型选择DataFrame
# 这样就会筛选出字段的类型为整形和浮点型的列(“number”表示选取所有数值列)
movies.select_dtypes(include=["int","float"])
|
num_critic_for_reviews |
duration |
director_facebook_likes |
actor_3_facebook_likes |
... |
actor_2_facebook_likes |
imdb_score |
aspect_ratio |
movie_facebook_likes |
---|
0 |
723.0 |
178.0 |
0.0 |
855.0 |
... |
936.0 |
7.9 |
1.78 |
33000 |
1 |
302.0 |
169.0 |
563.0 |
1000.0 |
... |
5000.0 |
7.1 |
2.35 |
0 |
2 |
602.0 |
148.0 |
0.0 |
161.0 |
... |
393.0 |
6.8 |
2.35 |
85000 |
3 |
813.0 |
164.0 |
22000.0 |
23000.0 |
... |
23000.0 |
8.5 |
2.35 |
164000 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
4912 |
43.0 |
43.0 |
NaN |
319.0 |
... |
593.0 |
7.5 |
16.00 |
32000 |
4913 |
13.0 |
76.0 |
0.0 |
0.0 |
... |
0.0 |
6.3 |
NaN |
16 |
4914 |
14.0 |
100.0 |
0.0 |
489.0 |
... |
719.0 |
6.3 |
2.35 |
660 |
4915 |
43.0 |
90.0 |
16.0 |
16.0 |
... |
23.0 |
6.6 |
1.85 |
456 |
4916 rows × 16 columns
9.4 通过过滤器选取DataFrame
9.4.1 like 过滤法
movies.filter(like="actor_1")
|
actor_1_facebook_likes |
actor_1_name |
---|
0 |
1000.0 |
CCH Pounder |
1 |
40000.0 |
Johnny Depp |
2 |
11000.0 |
Christoph Waltz |
3 |
27000.0 |
Tom Hardy |
... |
... |
... |
4912 |
841.0 |
Natalie Zea |
4913 |
0.0 |
Eva Boehnke |
4914 |
946.0 |
Alan Ruck |
4915 |
86.0 |
John August |
4916 rows × 2 columns
9.4.2 items 选取
movies.filter(items=["actor_1_name","actor_1_facebook_likes"])
|
actor_1_name |
actor_1_facebook_likes |
---|
0 |
CCH Pounder |
1000.0 |
1 |
Johnny Depp |
40000.0 |
2 |
Christoph Waltz |
11000.0 |
3 |
Tom Hardy |
27000.0 |
... |
... |
... |
4912 |
Natalie Zea |
841.0 |
4913 |
Eva Boehnke |
0.0 |
4914 |
Alan Ruck |
946.0 |
4915 |
John August |
86.0 |
4916 rows × 2 columns
9.4.3 正则选取
# 选取不是以actor开头的字段
movies.filter(regex="^(?!actor)")
|
color |
director_name |
num_critic_for_reviews |
duration |
... |
title_year |
imdb_score |
aspect_ratio |
movie_facebook_likes |
---|
0 |
Color |
James Cameron |
723.0 |
178.0 |
... |
2009.0 |
7.9 |
1.78 |
33000 |
1 |
Color |
Gore Verbinski |
302.0 |
169.0 |
... |
2007.0 |
7.1 |
2.35 |
0 |
2 |
Color |
Sam Mendes |
602.0 |
148.0 |
... |
2015.0 |
6.8 |
2.35 |
85000 |
3 |
Color |
Christopher Nolan |
813.0 |
164.0 |
... |
2012.0 |
8.5 |
2.35 |
164000 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
4912 |
Color |
NaN |
43.0 |
43.0 |
... |
NaN |
7.5 |
16.00 |
32000 |
4913 |
Color |
Benjamin Roberds |
13.0 |
76.0 |
... |
2013.0 |
6.3 |
NaN |
16 |
4914 |
Color |
Daniel Hsia |
14.0 |
100.0 |
... |
2012.0 |
6.3 |
2.35 |
660 |
4915 |
Color |
Jon Gunn |
43.0 |
90.0 |
... |
2004.0 |
6.6 |
1.85 |
456 |
4916 rows × 22 columns
10.按照给定的列顺序排序
disc_core = ['movie_title','title_year', 'content_rating','genres']
disc_people = ['director_name','actor_1_name', 'actor_2_name','actor_3_name']
disc_other = ['color','country','language','plot_keywords','movie_imdb_link']
cont_fb = ['director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes', 'cast_total_facebook_likes', 'movie_facebook_likes']
cont_finance = ['budget','gross']
cont_num_reviews = ['num_voted_users','num_user_for_reviews', 'num_critic_for_reviews']
cont_other = ['imdb_score','duration', 'aspect_ratio','facenumber_in_poster']
new_col_order = disc_core + disc_people + disc_other + cont_fb + cont_finance + cont_num_reviews + cont_other
# 我们把所有的列名按照自己的顺序进行排列,最后查看和原来的所有列是否一致
print(set(new_col_order) == set(movies.columns))
# 可以看到列显示的顺序已经按照我们的要求进行显示了
movies[new_col_order]
True
|
movie_title |
title_year |
content_rating |
genres |
... |
imdb_score |
duration |
aspect_ratio |
facenumber_in_poster |
---|
0 |
Avatar |
2009.0 |
PG-13 |
Action|Adventure|Fantasy|Sci-Fi |
... |
7.9 |
178.0 |
1.78 |
0.0 |
1 |
Pirates of the Caribbean: At World's End |
2007.0 |
PG-13 |
Action|Adventure|Fantasy |
... |
7.1 |
169.0 |
2.35 |
0.0 |
2 |
Spectre |
2015.0 |
PG-13 |
Action|Adventure|Thriller |
... |
6.8 |
148.0 |
2.35 |
1.0 |
3 |
The Dark Knight Rises |
2012.0 |
PG-13 |
Action|Thriller |
... |
8.5 |
164.0 |
2.35 |
0.0 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
4912 |
The Following |
NaN |
TV-14 |
Crime|Drama|Mystery|Thriller |
... |
7.5 |
43.0 |
16.00 |
1.0 |
4913 |
A Plague So Pleasant |
2013.0 |
NaN |
Drama|Horror|Thriller |
... |
6.3 |
76.0 |
NaN |
0.0 |
4914 |
Shanghai Calling |
2012.0 |
PG-13 |
Comedy|Drama|Romance |
... |
6.3 |
100.0 |
2.35 |
5.0 |
4915 |
My Date with Drew |
2004.0 |
PG |
Documentary |
... |
6.6 |
90.0 |
1.85 |
0.0 |
4916 rows × 28 columns
11 操作整个DataFrame
11.1 获取DataFrame的行和列数
movies.shape
(4916, 28)
11.2 获取DataFrame的个数
movies.size
137648
11.3 获取DataFrame的长度
len(movies)
4916
11.4 获取DataFrame的维度
movies.ndim
2
11.5 统计DataFrame中各种类型的个数
movies.count()
color 4897
director_name 4814
num_critic_for_reviews 4867
duration 4901
...
actor_2_facebook_likes 4903
imdb_score 4916
aspect_ratio 4590
movie_facebook_likes 4916
Length: 28, dtype: int64
11.6 取DataFrame中各列的最小值
# 都是对数值类型的列进行统计
movies.min()
num_critic_for_reviews 1.00
duration 7.00
director_facebook_likes 0.00
actor_3_facebook_likes 0.00
...
actor_2_facebook_likes 0.00
imdb_score 1.60
aspect_ratio 1.18
movie_facebook_likes 0.00
Length: 16, dtype: float64
11.7 取DataFrame中各列的最大值
movies.max()
num_critic_for_reviews 813.0
duration 511.0
director_facebook_likes 23000.0
actor_3_facebook_likes 23000.0
...
actor_2_facebook_likes 137000.0
imdb_score 9.5
aspect_ratio 16.0
movie_facebook_likes 349000.0
Length: 16, dtype: float64
11.8 取DataFrame中各列的平均值
movies.mean()
num_critic_for_reviews 137.988905
duration 107.090798
director_facebook_likes 691.014541
actor_3_facebook_likes 631.276313
...
actor_2_facebook_likes 1621.923516
imdb_score 6.437429
aspect_ratio 2.222349
movie_facebook_likes 7348.294142
Length: 16, dtype: float64
11.9 取DataFrame中各列的中位数
movies.median()
num_critic_for_reviews 108.00
duration 103.00
director_facebook_likes 48.00
actor_3_facebook_likes 366.00
...
actor_2_facebook_likes 593.00
imdb_score 6.60
aspect_ratio 2.35
movie_facebook_likes 159.00
Length: 16, dtype: float64
11.10 取DataFrame中各列的分位数
movies.quantile([0.5,0.3])
|
num_critic_for_reviews |
duration |
director_facebook_likes |
actor_3_facebook_likes |
... |
actor_2_facebook_likes |
imdb_score |
aspect_ratio |
movie_facebook_likes |
---|
0.5 |
108.0 |
103.0 |
48.0 |
366.0 |
... |
593.0 |
6.6 |
2.35 |
159.0 |
0.3 |
60.0 |
95.0 |
11.0 |
176.0 |
... |
345.0 |
6.0 |
1.85 |
0.0 |
2 rows × 16 columns
11.11 取DataFrame中各列的标准差
movies.std()
num_critic_for_reviews 120.239379
duration 25.286015
director_facebook_likes 2832.954125
actor_3_facebook_likes 1625.874802
...
actor_2_facebook_likes 4011.299523
imdb_score 1.127802
aspect_ratio 1.402940
movie_facebook_likes 19206.016458
Length: 16, dtype: float64
11.12 取DataFrame中各列的和
movies.sum()
num_critic_for_reviews 671592.00
duration 524852.00
director_facebook_likes 3326544.00
actor_3_facebook_likes 3088835.00
...
actor_2_facebook_likes 7952291.00
imdb_score 31646.40
aspect_ratio 10200.58
movie_facebook_likes 36124214.00
Length: 16, dtype: float64
补充说明:上面的统计函数可以传入skip = bool 选择是否跳过空值
11.13 获取DataFrame中各列的统计信息
movies.describe()
|
num_critic_for_reviews |
duration |
director_facebook_likes |
actor_3_facebook_likes |
... |
actor_2_facebook_likes |
imdb_score |
aspect_ratio |
movie_facebook_likes |
---|
count |
4867.000000 |
4901.000000 |
4814.000000 |
4893.000000 |
... |
4903.000000 |
4916.000000 |
4590.000000 |
4916.000000 |
mean |
137.988905 |
107.090798 |
691.014541 |
631.276313 |
... |
1621.923516 |
6.437429 |
2.222349 |
7348.294142 |
std |
120.239379 |
25.286015 |
2832.954125 |
1625.874802 |
... |
4011.299523 |
1.127802 |
1.402940 |
19206.016458 |
min |
1.000000 |
7.000000 |
0.000000 |
0.000000 |
... |
0.000000 |
1.600000 |
1.180000 |
0.000000 |
25% |
49.000000 |
93.000000 |
7.000000 |
132.000000 |
... |
277.000000 |
5.800000 |
1.850000 |
0.000000 |
50% |
108.000000 |
103.000000 |
48.000000 |
366.000000 |
... |
593.000000 |
6.600000 |
2.350000 |
159.000000 |
75% |
191.000000 |
118.000000 |
189.750000 |
633.000000 |
... |
912.000000 |
7.200000 |
2.350000 |
2000.000000 |
max |
813.000000 |
511.000000 |
23000.000000 |
23000.000000 |
... |
137000.000000 |
9.500000 |
16.000000 |
349000.000000 |
8 rows × 16 columns
11.14 判断DataFrame中的空值
movies.isnull()
|
color |
director_name |
num_critic_for_reviews |
duration |
... |
actor_2_facebook_likes |
imdb_score |
aspect_ratio |
movie_facebook_likes |
---|
0 |
False |
False |
False |
False |
... |
False |
False |
False |
False |
1 |
False |
False |
False |
False |
... |
False |
False |
False |
False |
2 |
False |
False |
False |
False |
... |
False |
False |
False |
False |
3 |
False |
False |
False |
False |
... |
False |
False |
False |
False |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
4912 |
False |
True |
False |
False |
... |
False |
False |
False |
False |
4913 |
False |
False |
False |
False |
... |
False |
False |
True |
False |
4914 |
False |
False |
False |
False |
... |
False |
False |
False |
False |
4915 |
False |
False |
False |
False |
... |
False |
False |
False |
False |
4916 rows × 28 columns
11.15 判断DataFrame中的非空值
movies.notnull()
|
color |
director_name |
num_critic_for_reviews |
duration |
... |
actor_2_facebook_likes |
imdb_score |
aspect_ratio |
movie_facebook_likes |
---|
0 |
True |
True |
True |
True |
... |
True |
True |
True |
True |
1 |
True |
True |
True |
True |
... |
True |
True |
True |
True |
2 |
True |
True |
True |
True |
... |
True |
True |
True |
True |
3 |
True |
True |
True |
True |
... |
True |
True |
True |
True |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
4912 |
True |
False |
True |
True |
... |
True |
True |
True |
True |
4913 |
True |
True |
True |
True |
... |
True |
True |
False |
True |
4914 |
True |
True |
True |
True |
... |
True |
True |
True |
True |
4915 |
True |
True |
True |
True |
... |
True |
True |
True |
True |
4916 rows × 28 columns
11.16 判断DataFrame中的值是否全为真
# 这样便可以查看那些是没有空值的列
movies.notnull().all()
color False
director_name False
num_critic_for_reviews False
duration False
...
actor_2_facebook_likes False
imdb_score True
aspect_ratio False
movie_facebook_likes True
Length: 28, dtype: bool
11.17 在进行对有空值的列进行统计时,填充空值后才能统计
# 必须将空值进行补充后才能统计,不然返回会为空
movies[["director_name"]].fillna("").max()
director_name Étienne Faure
dtype: object
12.DataFrame算数运算
12.1准备数据
matrix = pd.DataFrame([np.random.randint(0,10,3) for i in range(3)])
matrix
|
0 |
1 |
2 |
---|
0 |
2 |
0 |
5 |
1 |
8 |
0 |
6 |
2 |
6 |
5 |
9 |
12.2 加法运算
12.2.1 操作符方式
matrix + 5
|
0 |
1 |
2 |
---|
0 |
7 |
5 |
10 |
1 |
13 |
5 |
11 |
2 |
11 |
10 |
14 |
12.2.2 函数方式
matrix.add(5)
|
0 |
1 |
2 |
---|
0 |
7 |
5 |
10 |
1 |
13 |
5 |
11 |
2 |
11 |
10 |
14 |
12.3 减法运算
12.3.1 操作符方法
matrix - 5
|
0 |
1 |
2 |
---|
0 |
-3 |
-5 |
0 |
1 |
3 |
-5 |
1 |
2 |
1 |
0 |
4 |
12.3.2 函数方法
matrix.sub(5)
|
0 |
1 |
2 |
---|
0 |
-3 |
-5 |
0 |
1 |
3 |
-5 |
1 |
2 |
1 |
0 |
4 |
12.4 乘法运算
12.4.1 操作符方法
matrix * 5
|
0 |
1 |
2 |
---|
0 |
10 |
0 |
25 |
1 |
40 |
0 |
30 |
2 |
30 |
25 |
45 |
12.4.2 函数方法
matrix.mul(5)
|
0 |
1 |
2 |
---|
0 |
10 |
0 |
25 |
1 |
40 |
0 |
30 |
2 |
30 |
25 |
45 |
12.5 除法运算
12.5.1 操作符方法
matrix / 5
|
0 |
1 |
2 |
---|
0 |
0.4 |
0.0 |
1.0 |
1 |
1.6 |
0.0 |
1.2 |
2 |
1.2 |
1.0 |
1.8 |
12.5.2 函数方法
matrix.div(5)
|
0 |
1 |
2 |
---|
0 |
0.4 |
0.0 |
1.0 |
1 |
1.6 |
0.0 |
1.2 |
2 |
1.2 |
1.0 |
1.8 |
12.6 整除运算
12.6.1 操作符方法
matrix // 5
|
0 |
1 |
2 |
---|
0 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
1 |
12.6.2 函数方法
matrix.floordiv(5)
|
0 |
1 |
2 |
---|
0 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
1 |
12.7 取模运算
12.7.1 操作符方法
matrix % 5
|
0 |
1 |
2 |
---|
0 |
2 |
0 |
0 |
1 |
3 |
0 |
1 |
2 |
1 |
0 |
4 |
12.7.2 函数方法
matrix.mod(5)
|
0 |
1 |
2 |
---|
0 |
2 |
0 |
0 |
1 |
3 |
0 |
1 |
2 |
1 |
0 |
4 |
13.DataFrame比较运算
13.1 大于
13.1.1 算数方法
matrix > 5
|
0 |
1 |
2 |
---|
0 |
False |
False |
False |
1 |
True |
False |
True |
2 |
True |
False |
True |
13.1.2 函数方法
matrix.gt(5)
|
0 |
1 |
2 |
---|
0 |
False |
False |
False |
1 |
True |
False |
True |
2 |
True |
False |
True |
13.2 大于等于
13.2.1 算数方法
matrix >= 5
|
0 |
1 |
2 |
---|
0 |
False |
False |
True |
1 |
True |
False |
True |
2 |
True |
True |
True |
13.2.2 函数方法
matrix.ge(5)
|
0 |
1 |
2 |
---|
0 |
False |
False |
True |
1 |
True |
False |
True |
2 |
True |
True |
True |
13.3 小于
13.3.1 算数方法
matrix < 5
|
0 |
1 |
2 |
---|
0 |
True |
True |
False |
1 |
False |
True |
False |
2 |
False |
False |
False |
13.3.2 函数方法
matrix.lt(5)
|
0 |
1 |
2 |
---|
0 |
True |
True |
False |
1 |
False |
True |
False |
2 |
False |
False |
False |
13.4 小于等于
13.4.1 算数方法
matrix <= 5
|
0 |
1 |
2 |
---|
0 |
True |
True |
True |
1 |
False |
True |
False |
2 |
False |
True |
False |
13.4.2 函数方法
matrix.le(5)
|
0 |
1 |
2 |
---|
0 |
True |
True |
True |
1 |
False |
True |
False |
2 |
False |
True |
False |
13.5 等于
13.5.1 算数方法
matrix == 5
|
0 |
1 |
2 |
---|
0 |
False |
False |
True |
1 |
False |
False |
False |
2 |
False |
True |
False |
13.5.2 函数方法
matrix.eq(5)
|
0 |
1 |
2 |
---|
0 |
False |
False |
True |
1 |
False |
False |
False |
2 |
False |
True |
False |
13.6 不等于
13.6.1 算数方法
matrix != 5
|
0 |
1 |
2 |
---|
0 |
True |
True |
False |
1 |
True |
True |
True |
2 |
True |
False |
True |
13.6.2 函数方法
matrix.ne(5)
|
0 |
1 |
2 |
---|
0 |
True |
True |
False |
1 |
True |
True |
True |
2 |
True |
False |
True |