0. 理论

推荐系统使用基于项目的协同过滤,优点是:
1)人是善变的,但项目不随着时间发生变化。
2)推荐的项目比人少得多,可以节省大量的计算能力,可以使用更加复杂的算法。

思路:建立基于项目的协同过滤系统,即实现“看了这部电影的人也看了…”和“对这部电影高度评价的人也高度评价了…”这些功能,即建立电影之间的联系。

1.用python3实现电影推荐

import pandas as pd 
import warnings
warnings.filterwarnings ("ignore")

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习:从入门到实践》源代码/ml-100k/u.data', sep = '\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")
ratings.head()



user_id

movie_id

rating

0

0

50

5

1

0

172

5

2

0

133

1

3

196

242

3

4

186

302

3

m_cols = ['movie_id', 'title']
movies = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习:从入门到实践》源代码/ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")
movies.head()



movie_id

title

0

1

Toy Story (1995)

1

2

GoldenEye (1995)

2

3

Four Rooms (1995)

3

4

Get Shorty (1995)

4

5

Copycat (1995)

# 连接两张表
ratings = pd.merge(movies, ratings, how = 'inner', on = 'movie_id')
ratings.head()



movie_id

title

user_id

rating

0

1

Toy Story (1995)

308

4

1

1

Toy Story (1995)

287

5

2

1

Toy Story (1995)

148

4

3

1

Toy Story (1995)

280

4

4

1

Toy Story (1995)

66

3

# 转置表格
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
userRatings.head()



title

'Til There Was You (1997)

1-900 (1994)

101 Dalmatians (1996)

12 Angry Men (1957)

187 (1997)

2 Days in the Valley (1996)

20,000 Leagues Under the Sea (1954)

2001: A Space Odyssey (1968)

3 Ninjas: High Noon At Mega Mountain (1998)

39 Steps, The (1935)

...

Yankee Zulu (1994)

Year of the Horse (1997)

You So Crazy (1994)

Young Frankenstein (1974)

Young Guns (1988)

Young Guns II (1990)

Young Poisoner's Handbook, The (1995)

Zeus and Roxanne (1997)

unknown

Á köldum klaka (Cold Fever) (1994)

user_id

0

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

1

NaN

NaN

2.0

5.0

NaN

NaN

3.0

4.0

NaN

NaN

...

NaN

NaN

NaN

5.0

3.0

NaN

NaN

NaN

4.0

NaN

2

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

1.0

NaN

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

3

NaN

NaN

NaN

NaN

2.0

NaN

NaN

NaN

NaN

NaN

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

4

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

5 rows × 1664 columns

# 皮尔逊相关度, 筛选数据量大于等于100的样本(至少有100人打分)
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
corrMatrix.head()



title

'Til There Was You (1997)

1-900 (1994)

101 Dalmatians (1996)

12 Angry Men (1957)

187 (1997)

2 Days in the Valley (1996)

20,000 Leagues Under the Sea (1954)

2001: A Space Odyssey (1968)

3 Ninjas: High Noon At Mega Mountain (1998)

39 Steps, The (1935)

...

Yankee Zulu (1994)

Year of the Horse (1997)

You So Crazy (1994)

Young Frankenstein (1974)

Young Guns (1988)

Young Guns II (1990)

Young Poisoner's Handbook, The (1995)

Zeus and Roxanne (1997)

unknown

Á köldum klaka (Cold Fever) (1994)

title

'Til There Was You (1997)

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

1-900 (1994)

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

101 Dalmatians (1996)

NaN

NaN

1.0

NaN

NaN

NaN

NaN

NaN

NaN

NaN

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

12 Angry Men (1957)

NaN

NaN

NaN

1.0

NaN

NaN

NaN

NaN

NaN

NaN

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

187 (1997)

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

5 rows × 1664 columns

# 选取user_id为0的行数据,作为测试数据
myRatings = userRatings.loc[0].dropna()
myRatings
title
Empire Strikes Back, The (1980)    5.0
Gone with the Wind (1939)          1.0
Star Wars (1977)                   5.0
Name: 0, dtype: float64
myRatings[0]
5.0
# 为测试用户推荐电影
simCandidates = pd.Series()
for i in range(myRatings.shape[0]):
    print("寻找《{}》的相似电影...".format(myRatings.index[i]))
    # 对应的相似电影
    sims = corrMatrix[myRatings.index[i]].dropna()
    # 将该电影的分数与它相似电影的分数相乘
    sims = sims.map(lambda x: x * myRatings[i])
    # 加入结果
    simCandidates = simCandidates.append(sims)
    

print ("结果:")
simCandidates.sort_values(inplace = True, ascending = False)
print (simCandidates.head(10))
寻找《Empire Strikes Back, The (1980)》的相似电影...
寻找《Gone with the Wind (1939)》的相似电影...
寻找《Star Wars (1977)》的相似电影...
结果:
Empire Strikes Back, The (1980)                       5.000000
Star Wars (1977)                                      5.000000
Empire Strikes Back, The (1980)                       3.741763
Star Wars (1977)                                      3.741763
Return of the Jedi (1983)                             3.606146
Return of the Jedi (1983)                             3.362779
Raiders of the Lost Ark (1981)                        2.693297
Raiders of the Lost Ark (1981)                        2.680586
Austin Powers: International Man of Mystery (1997)    1.887164
Sting, The (1973)                                     1.837692
dtype: float64
# 将结果按电影名称汇总
simCandidates = simCandidates.groupby(simCandidates.index).sum()
# 将分数倒序排列
simCandidates.sort_values(inplace = True, ascending = False)
# 删除自己已评分的电影,由于自己看过的电影无需被推荐
filteredSims = simCandidates.drop(myRatings.index)
print("推荐结果:")
filteredSims.head(10)
推荐结果:





Return of the Jedi (1983)                    7.178172
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.488028
Bridge on the River Kwai, The (1957)         3.366616
Back to the Future (1985)                    3.357941
Sting, The (1973)                            3.329843
Cinderella (1950)                            3.245412
Field of Dreams (1989)                       3.222311
Wizard of Oz, The (1939)                     3.200268
Dumbo (1941)                                 2.981645
dtype: float64

2.改善推荐结果

尝试以下方法改善推荐结果:
1)相关系数: 使用spearman相关系数
2)改变min_period的值
3)当用户讨厌某个电影时,与它相似的电影不应被推荐。

def recommand_movie(method='pearson', min_period=100, weight=0):
    # 设置相关度, 设置筛选数据量
    corrMatrix = userRatings.corr(method=method, min_periods=min_period)
    # 选取user_id为0的行数据,作为测试数据
    myRatings = userRatings.loc[0].dropna()
    # 为测试用户推荐电影
    simCandidates = pd.Series()
    for i in range(myRatings.shape[0]):        
        # 对应的相似电影
        sims = corrMatrix[myRatings.index[i]].dropna()
        if myRatings[i] == 1:
            # 如果用户讨厌一个电影,那么该电影的的分数应该被降低
            sims = sims.map(lambda x: x * myRatings[i] * weight * (-1))
        else:
            # 将该电影的分数与它相似电影的分数相乘
            sims = sims.map(lambda x: x * myRatings[i])            
        # 加入结果
        simCandidates = pd.concat([simCandidates, sims], axis = 0)
    
    # 将结果按电影名称汇总
    simCandidates = simCandidates.groupby(simCandidates.index).sum()
    # 将分数倒序排列
    simCandidates.sort_values(inplace = True, ascending = False)
    # 删除自己已评分的电影,由于自己看过的电影无需被推荐
    simCandidates = simCandidates.drop(myRatings.index)
    print("为用户0推荐电影:")
    print(simCandidates.head(10))
recommand_movie(method='spearman', min_period = 100, weight = 0)
为用户0推荐电影:
Return of the Jedi (1983)                    6.407001
Raiders of the Lost Ark (1981)               4.528739
Indiana Jones and the Last Crusade (1989)    3.299785
Sting, The (1973)                            3.273064
Batman (1989)                                3.012647
Singin' in the Rain (1952)                   2.952571
Field of Dreams (1989)                       2.945751
Dumbo (1941)                                 2.894872
Jaws (1975)                                  2.867804
Star Trek: The Wrath of Khan (1982)          2.859166
dtype: float64
recommand_movie(method='pearson', min_period = 80, weight= 0)
为用户0推荐电影:
Return of the Jedi (1983)                    6.968925
Raiders of the Lost Ark (1981)               5.373883
Bridge on the River Kwai, The (1957)         3.366616
Indiana Jones and the Last Crusade (1989)    3.316717
Cinderella (1950)                            3.245412
Sting, The (1973)                            3.209627
Con Air (1997)                               3.204525
Back to the Future (1985)                    3.100622
Day the Earth Stood Still, The (1951)        3.087913
Field of Dreams (1989)                       3.068508
dtype: float64
recommand_movie(method='pearson', min_period = 100, weight= 0.5)
为用户0推荐电影:
Return of the Jedi (1983)                    6.864302
Raiders of the Lost Ark (1981)               5.300974
Bridge on the River Kwai, The (1957)         3.366616
Cinderella (1950)                            3.245412
Indiana Jones and the Last Crusade (1989)    3.231061
Sting, The (1973)                            3.149519
Field of Dreams (1989)                       2.991607
Dumbo (1941)                                 2.981645
Back to the Future (1985)                    2.971963
Star Trek: The Wrath of Khan (1982)          2.968080
dtype: float64

3.参考资料

Python数据科学与机器学习:从入门到实践
作者:
[美]弗兰克•凯恩(Frank Kane)