开始我的数据分析历程,基于《利用python进行数据分析》

电影数据分析

首先一个小示例电影数据分析,通过下载下来的MovieLens 1M数据集,它分为三个表:评分,用户信息和电影信息,通过pandas.read_table将各个表分别读到一个pandas.DataFrame对象中:

import pandas as pd
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('ml-1m/users.dat',sep='::',header = None,names=unames,engine='python')
rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('ml-1m/ratings.dat',sep='::',header = None,names = rnames,engine='python')

knames = ['movie_id','title','genres']
movies = pd.read_table('ml-1m/movies.dat',sep='::',header = None,names = knames,engine='python')

利用python的切片,通过查看前几行即可验证数据加载工作是否一切顺利:

>>>print(users[:5])
   user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455

>>>print(ratings[:5])
   user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291

我们用pandas的merge函数将ratings跟users合并一起,然后将movies也合并进去,pandas会根据列名的重叠情况推断出哪些列是合并键:

data = pd.merge(pd.merge(ratings,users),movies)
#ix是通过通过行标签或者行号索引行数据
print(data.ix[0])

显示:

user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

这个时候,只要稍微熟悉一下pandas,就能轻松地根据任意用户或电影属性对评分数据进行聚合操作。
为了按性别计算每部电影的平均得分,我们可以使用pivot_table方法:

#按性别计算每部电影的平均的分,我们可以使用pivot_table
mean_rating = data.pivot_table('rating',index='title',columns='gender',aggfunc='mean')
print(mean_rating.ix[:5])

显示:

gender                                F         M
title                                            
$1,000,000 Duck (1971)         3.375000  2.761905
'Night Mother (1986)           3.388889  3.352941
'Til There Was You (1997)      2.675676  2.733333
'burbs, The (1989)             2.793478  2.962085
...And Justice for All (1979)  3.828571  3.689024

该操作产生了另一个DataFrame,其内容为电影平均分,行标为电影名称,列标为性别。
现在打算过滤掉评分数据不够250条的电影,先对title分组,然后利用size()得到一个含有电影分组大小的Series对象:

#过滤掉评分数据不够250条的电影,先对title分组,再利用size()得到一个含有各电影分组大小的Seriesd对象
ratings_by_title = data.groupby('title').size()
print(ratings_by_title[:10])

显示:

title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64

然后过滤出大于250的

active_titles = ratings_by_title.index[ratings_by_title >= 250]
print(active_titles)

显示:

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)
mean_rating = mean_rating.ix[active_titles]
print(mean_rating)

显示:

gender                                                     F         M
title                                                                 
'burbs, The (1989)                                  2.793478  2.962085
10 Things I Hate About You (1999)                   3.646552  3.311966
101 Dalmatians (1961)                               3.791444  3.500000
101 Dalmatians (1996)                               3.240000  2.911215
12 Angry Men (1957)                                 4.184397  4.328421
13th Warrior, The (1999)                            3.112000  3.168000
2 Days in the Valley (1996)                         3.488889  3.244813
20,000 Leagues Under the Sea (1954)                 3.670103  3.709205
2001: A Space Odyssey (1968)                        3.825581  4.129738
2010 (1984)                                         3.446809  3.413712
28 Days (2000)                                      3.209424  2.977707
39 Steps, The (1935)                                3.965517  4.107692
54 (1998)                                           2.701754  2.782178
7th Voyage of Sinbad, The (1958)                    3.409091  3.658879
8MM (1999)                                          2.906250  2.850962
About Last Night... (1986)                          3.188679  3.140909
Absent Minded Professor, The (1961)                 3.469388  3.446809
Absolute Power (1997)                               3.469136  3.327759
Abyss, The (1989)                                   3.659236  3.689507
Ace Ventura: Pet Detective (1994)                   3.000000  3.197917
Ace Ventura: When Nature Calls (1995)               2.269663  2.543333
Addams Family Values (1993)                         3.000000  2.878531
Addams Family, The (1991)                           3.186170  3.163498
Adventures in Babysitting (1987)                    3.455782  3.208122
Adventures of Buckaroo Bonzai Across the 8th Di...  3.308511  3.402321
Adventures of Priscilla, Queen of the Desert, T...  3.989071  3.688811
Adventures of Robin Hood, The (1938)                4.166667  3.918367
African Queen, The (1951)                           4.324232  4.223822
Age of Innocence, The (1993)                        3.827068  3.339506
Agnes of God (1985)                                 3.534884  3.244898
...                                                      ...       ...
White Men Can't Jump (1992)                         3.028777  3.231061
Who Framed Roger Rabbit? (1988)                     3.569378  3.713251
Who's Afraid of Virginia Woolf? (1966)              4.029703  4.096939
Whole Nine Yards, The (2000)                        3.296552  3.404814
Wild Bunch, The (1969)                              3.636364  4.128099
Wild Things (1998)                                  3.392000  3.459082
Wild Wild West (1999)                               2.275449  2.131973
William Shakespeare's Romeo and Juliet (1996)       3.532609  3.318644
Willow (1988)                                       3.658683  3.453543
Willy Wonka and the Chocolate Factory (1971)        4.063953  3.789474
Witness (1985)                                      4.115854  3.941504
Wizard of Oz, The (1939)                            4.355030  4.203138
Wolf (1994)                                         3.074074  2.899083
Women on the Verge of a Nervous Breakdown (1988)    3.934307  3.865741
Wonder Boys (2000)                                  4.043796  3.913649
Working Girl (1988)                                 3.606742  3.312500
World Is Not Enough, The (1999)                     3.337500  3.388889
Wrong Trousers, The (1993)                          4.588235  4.478261
Wyatt Earp (1994)                                   3.147059  3.283898
X-Files: Fight the Future, The (1998)               3.489474  3.493797
X-Men (2000)                                        3.682310  3.851702
Year of Living Dangerously (1982)                   3.951220  3.869403
Yellow Submarine (1968)                             3.714286  3.689286
You've Got Mail (1998)                              3.542424  3.275591
Young Frankenstein (1974)                           4.289963  4.239177
Young Guns (1988)                                   3.371795  3.425620
Young Guns II (1990)                                2.934783  2.904025
Young Sherlock Holmes (1985)                        3.514706  3.363344
Zero Effect (1998)                                  3.864407  3.723140
eXistenZ (1999)                                     3.098592  3.289086

[1216 rows x 2 columns]

为了解女性观众最喜欢的电影,我们可以对F列降序排列:

#为了解女性观众喜欢的电影,我们可以对F列降序排列,ascending=False代表降序
top_female_ratings = mean_rating.sort_values(by='F',ascending=False)
print(top_female_ratings[:10])

显示:

gender                                                     F         M
title                                                                 
Close Shave, A (1995)                               4.644444  4.473795
Wrong Trousers, The (1993)                          4.588235  4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650  4.464589
Wallace & Gromit: The Best of Aardman Animation...  4.563107  4.385075
Schindler's List (1993)                             4.562602  4.491415
Shawshank Redemption, The (1994)                    4.539075  4.560625
Grand Day Out, A (1992)                             4.537879  4.293255
To Kill a Mockingbird (1962)                        4.536667  4.372611
Creature Comforts (1990)                            4.513889  4.272277
Usual Suspects, The (1995)                          4.513317  4.518248

计算评分分歧

假设我们想要找出男性和女性观众分歧最大的电影。一个办法是给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排列:

mean_rating['diff'] = mean_rating['M']-mean_rating['F']
sorted_by_diff = mean_rating.sort_values(by='diff')
print(sorted_by_diff[:15])

运行结果:

gender                                        F         M      diff
title                                                              
Dirty Dancing (1987)                   3.790378  2.959596 -0.830782
Jumpin' Jack Flash (1986)              3.254717  2.578358 -0.676359
Grease (1978)                          3.975265  3.367041 -0.608224
Little Women (1994)                    3.870588  3.321739 -0.548849
Steel Magnolias (1989)                 3.901734  3.365957 -0.535777
Anastasia (1997)                       3.800000  3.281609 -0.518391
Rocky Horror Picture Show, The (1975)  3.673016  3.160131 -0.512885
Color Purple, The (1985)               4.158192  3.659341 -0.498851
Age of Innocence, The (1993)           3.827068  3.339506 -0.487561
Free Willy (1993)                      2.921348  2.438776 -0.482573
French Kiss (1995)                     3.535714  3.056962 -0.478752
Little Shop of Horrors, The (1960)     3.650000  3.179688 -0.470312
Guys and Dolls (1955)                  4.051724  3.583333 -0.468391
Mary Poppins (1964)                    4.197740  3.730594 -0.467147
Patch Adams (1998)                     3.473282  3.008746 -0.464536

对排序结果反序并取出前15行,得到的则是男性观众更喜欢的电影:

print(sorted_by_diff[::-1][:15])

运行结果:

gender                                         F         M      diff
title                                                               
Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
Longest Day, The (1962)                 3.411765  4.031447  0.619682
Cable Guy, The (1996)                   2.250000  2.863787  0.613787
Evil Dead II (Dead By Dawn) (1987)      3.297297  3.909283  0.611985
Hidden, The (1987)                      3.137931  3.745098  0.607167
Rocky III (1982)                        2.361702  2.943503  0.581801
Caddyshack (1980)                       3.396135  3.969737  0.573602
For a Few Dollars More (1965)           3.409091  3.953795  0.544704
Porky's (1981)                          2.296875  2.836364  0.539489
Animal House (1978)                     3.628906  4.167192  0.538286
Exorcist, The (1973)                    3.537634  4.067239  0.529605
Fright Night (1985)                     2.973684  3.500000  0.526316
Barb Wire (1996)                        1.585366  2.100386  0.515020

如果只是想要找出分歧最大的电影(不考虑性别因素),则可以计算得分数据的方差或者标准差:

#根据电影名称分组的得分数据的标准差
ratings_std_by_title = data.groupby('title')['rating'].std()
#根据active_titles进行过滤
ratings_std_by_title = ratings_std_by_title.ix[active_titles]
#根据值对Series进行降序排列
print(ratings_std_by_title.sort_values(ascending=False)[:10])

运行结果:

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64