• 任务1:找出最多人想读的50本书的名称

  • 任务2:找出这50本书对应最热门的10个标签

文件1:to_read.csv

  • 每行两个数据,用户 id 和该用户想读的书籍 id

文件2:books.csv

  • 书籍的各类 id,名称,作者等信息

文件3:tags.csv

  • 每行两个数据,标签 id 和标签名称

文件4:book_tags.csv

  • 每行三个数据,_goodreads_book_id_(和 to_read 中的书籍 id 的对应关系可以在 books.csv 里找到),标签 id,标记次数

数据集:

数据集链接:

解答

Python 原生的处理方式,代码如下「简版代码」:

import pandas as pd
import numpy as np


data = pd.read_csv('../to_read.csv')
# print(data)
new_data = data['book_id']
# array_lengt = len(set(data['book_id']))
# print(array_lengt)
# book_count_array = np.zeros(array_lengt)
# print(set(new_data))
book_id_values = {}
result = list(new_data)
for data in set(new_data):
	book_id_values[data] = result.count(data)
# print(book_id_values)
d_sorted_by_value = sorted(book_id_values.items(), key=lambda x: x[1]) # 根据字典值的升序排序
a = d_sorted_by_value[::-1][:50]
print(a)
print(len(a))
[(47, 2772), (143, 1967), (113, 1840), (13, 1812), (11, 1767), (45, 1717), (139, 1650), (39, 1619), (65, 1608), (35, 1576), (342, 1521), (185, 1502), (119, 1499), (8, 1498), (6, 1484), (4, 1478), (94, 1460), (89, 1458), (55, 1441), (61, 1435), (109, 1432), (16, 1425), (31, 1417), (67, 1352), (146, 1342), (54, 1339), (46, 1325), (121, 1313), (5, 1293), (173, 1292), (115, 1285), (68, 1257), (36, 1211), (95, 1208), (167, 1188), (129, 1181), (265, 1180), (137, 1172), (277, 1160), (66, 1158), (267, 1154), (268, 1149), (28, 1148), (38, 1130), (60, 1129), (14, 1127), (225, 1111), (10, 1110), (233, 1106), (252, 1105)]
50
[Finished in 147.9s]

Pandas 代码:

第一步,找到最热的50本书

import pandas as pd
import numpy as np


to_read = pd.read_csv('../to_read.csv')
to_read_counts = to_read['book_id'].value_counts().sort_values(ascending=False)
hottest_50_books_id = to_read_counts[:50].index
hottest_50_books_counts = to_read_counts[:50].values
hottest_50_books = pd.DataFrame({
	'book_id': hottest_50_books_id,
	'to_read_counts': hottest_50_books_counts
	})

"""
涉及到的知识点
1. value_counts(): 	计算重复元素出现的次数「显示形式为:值为索引,次数为值」
2. sort_values(): 	按某一列的数据进行排序,使用 by=列名,来指定。 
					默认是升序排序,可以使用 ascending=False 来反转
"""
print(hottest_50_books)
    book_id  to_read_counts
0        47            2772
1       143            1967
2       113            1840
3        13            1812
4        11            1767
5        45            1717
6       139            1650
7        39            1619
8        65            1608
9        35            1576
10      342            1521
11      185            1502
12      119            1499
13        8            1498
14        6            1484
15        4            1478
16       94            1460
17       89            1458
18       55            1441
19       61            1435
20      109            1432
21       16            1425
22       31            1417
23       67            1352
24      146            1342
25       54            1339
26       46            1325
27      121            1313
28        5            1293
29      173            1292
30      115            1285
31       68            1257
32       36            1211
33       95            1208
34      167            1188
35      129            1181
36      265            1180
37      137            1172
38      277            1160
39       66            1158
40      267            1154
41      268            1149
42       28            1148
43       38            1130
44       60            1129
45       14            1127
46      225            1111
47       10            1110
48      233            1106
49      252            1105
[Finished in 0.6s]

第二步,找到书籍的名称

books = pd.read_csv('../books.csv')
book_id_and_title = books[['book_id', 'goodreads_book_id', 'title']]
hottest_50_books_with_title = pd.merge(
	hottest_50_books, 
	book_id_and_title, 
	how='left')
print(hottest_50_books_with_title)
hottest_50_books_with_title.to_csv('hottest_50_books_with_title.csv')

第三步,找到这50本书对应最热的10个标签

book_tags = pd.read_csv('../book_tags.csv')
book_tags = book_tags[book_tags['_goodreads_book_id_'].isin(hottest_50_books_with_title['goodreads_book_id'])]
del book_tags['_goodreads_book_id_']
hottest_10_tags = book_tags.groupby('tag_id').sum()
hottest_10_tags = hottest_10_tags.sort_values(by='count', ascending=False)[:10]
hottest_10_tags = pd.DataFrame({
	'tag_id': hottest_10_tags.index,
	'count': hottest_10_tags['count']
	})
print(hottest_10_tags['tag_id'])

第四步,找到这10个标签的名称

tags = pd.read_csv('../tags.csv')
hottest_10_tags_with_tag_name = pd.merge(
	hottest_10_tags,
	tags,
	on='tag_id',
	how='left')
print(hottest_10_tags_with_tag_name)
hottest_10_tags_with_tag_name.to_csv('hottest_10_tags_with_tag_name.csv')