文章目录
- 3.1.2 用pandas加载数据集
- 3.1.3 清洗数据集
- 现在计算这些的实际值
- 主队和客队最后一场比赛赢了吗?
- 3.2.1 决策树中的参数
- 3.2.2 决策树的使用|
- 3.3 体育赛事结果预测
思维脑图
import os
import numpy as np
import pandas as pd
home_folder = "./PythonDataMining/"
data_folder = os.path.join(home_folder,'data')
data_filename = os.path.join(data_folder, "leagues_NBA_2014_games_games.csv")
3.1.2 用pandas加载数据集
results = pd.read_csv(data_filename)
results.iloc[:5]
| Date
| Score Type
| Visitor Team
| VisitorPts
| Home Team
| HomePts
| OT?
| Notes
|
0
| Tue Oct 29 2013
| Box Score
| Orlando Magic
| 87
| Indiana Pacers
| 97
| NaN
| NaN
|
1
| Tue Oct 29 2013
| Box Score
| Los Angeles Clippers
| 103
| Los Angeles Lakers
| 116
| NaN
| NaN
|
2
| Tue Oct 29 2013
| Box Score
| Chicago Bulls
| 95
| Miami Heat
| 107
| NaN
| NaN
|
3
| Wed Oct 30 2013
| Box Score
| Brooklyn Nets
| 94
| Cleveland Cavaliers
| 98
| NaN
| NaN
|
4
| Wed Oct 30 2013
| Box Score
| Atlanta Hawks
| 109
| Dallas Mavericks
| 118
| NaN
| NaN
|
3.1.3 清洗数据集
results = pd.read_csv(data_filename, skiprows=[0,])
# Fix the name of the columns
results.columns = ["Date", "Score Type", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Notes"]
results.iloc[:5]
| Date
| Score Type
| Visitor Team
| VisitorPts
| Home Team
| HomePts
| OT?
| Notes
|
0
| Tue Oct 29 2013
| Box Score
| Los Angeles Clippers
| 103
| Los Angeles Lakers
| 116
| NaN
| NaN
|
1
| Tue Oct 29 2013
| Box Score
| Chicago Bulls
| 95
| Miami Heat
| 107
| NaN
| NaN
|
2
| Wed Oct 30 2013
| Box Score
| Brooklyn Nets
| 94
| Cleveland Cavaliers
| 98
| NaN
| NaN
|
3
| Wed Oct 30 2013
| Box Score
| Atlanta Hawks
| 109
| Dallas Mavericks
| 118
| NaN
| NaN
|
4
| Wed Oct 30 2013
| Box Score
| Washington Wizards
| 102
| Detroit Pistons
| 113
| NaN
| NaN
|
results['HomeWin'] = results['VisitorPts'] < results['HomePts']
y_true = results['HomeWin'].values
results.iloc[:5]
| Date
| Score Type
| Visitor Team
| VisitorPts
| Home Team
| HomePts
| OT?
| Notes
| HomeWin
|
0
| Tue Oct 29 2013
| Box Score
| Los Angeles Clippers
| 103
| Los Angeles Lakers
| 116
| NaN
| NaN
| True
|
1
| Tue Oct 29 2013
| Box Score
| Chicago Bulls
| 95
| Miami Heat
| 107
| NaN
| NaN
| True
|
2
| Wed Oct 30 2013
| Box Score
| Brooklyn Nets
| 94
| Cleveland Cavaliers
| 98
| NaN
| NaN
| True
|
3
| Wed Oct 30 2013
| Box Score
| Atlanta Hawks
| 109
| Dallas Mavericks
| 118
| NaN
| NaN
| True
|
4
| Wed Oct 30 2013
| Box Score
| Washington Wizards
| 102
| Detroit Pistons
| 113
| NaN
| NaN
| True
|
print("Home Win 百分比: {0:.1f}%".format(100 * results["HomeWin"].sum() / results["HomeWin"].count()))
results["HomeLastWin"] = False
results["VisitorLastWin"] = False
# This creates two new columns, all set to False
results.iloc[:5]
| Date
| Score Type
| Visitor Team
| VisitorPts
| Home Team
| HomePts
| OT?
| Notes
| HomeWin
| HomeLastWin
| VisitorLastWin
|
0
| Tue Oct 29 2013
| Box Score
| Los Angeles Clippers
| 103
| Los Angeles Lakers
| 116
| NaN
| NaN
| True
| False
| False
|
1
| Tue Oct 29 2013
| Box Score
| Chicago Bulls
| 95
| Miami Heat
| 107
| NaN
| NaN
| True
| False
| False
|
2
| Wed Oct 30 2013
| Box Score
| Brooklyn Nets
| 94
| Cleveland Cavaliers
| 98
| NaN
| NaN
| True
| False
| False
|
3
| Wed Oct 30 2013
| Box Score
| Atlanta Hawks
| 109
| Dallas Mavericks
| 118
| NaN
| NaN
| True
| False
| False
|
4
| Wed Oct 30 2013
| Box Score
| Washington Wizards
| 102
| Detroit Pistons
| 113
| NaN
| NaN
| True
| False
| False
|
现在计算这些的实际值
主队和客队最后一场比赛赢了吗?
# Now compute the actual values for these
# Did the home and visitor teams win their last game?
from collections import defaultdict
won_last = defaultdict(int)
for index, row in results.iterrows(): # Note that this is not efficient
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeLastWin"] = won_last[home_team]
row["VisitorLastWin"] = won_last[visitor_team]
results.iloc[index] = row
# Set current win
won_last[home_team] = row["HomeWin"]
won_last[visitor_team] = not row["HomeWin"]
results.iloc[20:25]
| Date
| Score Type
| Visitor Team
| VisitorPts
| Home Team
| HomePts
| OT?
| Notes
| HomeWin
| HomeLastWin
| VisitorLastWin
|
20
| Fri Nov 1 2013
| Box Score
| Miami Heat
| 100
| Brooklyn Nets
| 101
| NaN
| NaN
| True
| False
| False
|
21
| Fri Nov 1 2013
| Box Score
| Cleveland Cavaliers
| 84
| Charlotte Bobcats
| 90
| NaN
| NaN
| True
| False
| True
|
22
| Fri Nov 1 2013
| Box Score
| Portland Trail Blazers
| 113
| Denver Nuggets
| 98
| NaN
| NaN
| False
| False
| False
|
23
| Fri Nov 1 2013
| Box Score
| Dallas Mavericks
| 105
| Houston Rockets
| 113
| NaN
| NaN
| True
| True
| True
|
24
| Fri Nov 1 2013
| Box Score
| San Antonio Spurs
| 91
| Los Angeles Lakers
| 85
| NaN
| NaN
| False
| False
| True
|
3.2 决策树
决策树是一种有监督的机器学习算法,它看起来就像是由一系列节点组成的流程图,其中位
于上层节点的值决定下一步走向哪个节点。
%%html
<img src = './image/决策树1.png',width=100,height=100>
<img src = ‘./image/决策树1.png’,width=100,height=100>
跟大多数分类算法一样,决策树也分为两大步骤。
首先是训练阶段,用训练数据构造一棵树。上一章的近邻算法没有训练阶段,但是决策
树需要。从这个意义上说,近邻算法是一种惰性算法,在用它进行分类时,它才开始干
活。相反,决策树跟大多数机器学习方法类似,是一种积极学习的算法,在训练阶段完
成模型的创建。
其次是预测阶段,用训练好的决策树预测新数据的类别。以上图为例,[“is raining”,
“very windy”]的预测结果为“Bad”(坏天气)。
创建决策树的算法有多种,大都通过迭代生成一棵树。它们从根节点开始,选取最佳特征,
用于第一个决策,到达下一个节点,选择下一个最佳特征,以此类推。当发现无法从增加树的层
级中获得更多信息时,算法启动退出机制。
scikit-learn库实现了分类回归树(Classification and Regression Trees,CART)算法并将
其作为生成决策树的默认算法,它支持连续型特征和类别型特征。
3.2.1 决策树中的参数
3.2.2 决策树的使用|
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=14)
from sklearn.model_selection import cross_val_score
X_previouswins = results[['HomeLastWin','VisitorLastWin']].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf,X_previouswins,y_true,scoring = 'accuracy')
print('Using just the last result from the home and visitor teams')
print('Accuracy: {0:.1f}%'.format(np.mean(scores)*100))
Using just the last result from the home and visitor teams
Accuracy: 59.1%
3.3 体育赛事结果预测
# What about win streaks?
results["HomeWinStreak"] = 0
results["VisitorWinStreak"] = 0
# Did the home and visitor teams win their last game?
from collections import defaultdict
win_streak = defaultdict(int)
for index, row in results.iterrows(): # Note that this is not efficient
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeWinStreak"] = win_streak[home_team]
row["VisitorWinStreak"] = win_streak[visitor_team]
results.loc[index] = row
# Set current win
if row["HomeWin"]:
win_streak[home_team] += 1
win_streak[visitor_team] = 0
else:
win_streak[home_team] = 0
win_streak[visitor_team] += 1
clf = DecisionTreeClassifier(random_state=14)
X_winstreak = results[["HomeLastWin", "VisitorLastWin", "HomeWinStreak", "VisitorWinStreak"]].values
scores = cross_val_score(clf, X_winstreak, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
Accuracy: 58.4%
我们试试看哪个队在阶梯上更好。使用上一年的梯子
ladder_filename = os.path.join(data_folder, "leagues_NBA_2013_standings_expanded-standings.csv")
ladder = pd.read_csv(ladder_filename)
ladder.head()
| Rk
| Team
| Overall
| Home
| Road
| E
| W
| A
| C
| SE
| ...
| Post
| ≤3
| ≥10
| Oct
| Nov
| Dec
| Jan
| Feb
| Mar
| Apr
|
0
| 1
| Miami Heat
| 66-16
| 37-4
| 29-12
| 41-11
| 25-5
| 14-4
| 12-6
| 15-1
| ...
| 30-2
| 9-3
| 39-8
| 1-0
| 10-3
| 10-5
| 8-5
| 12-1
| 17-1
| 8-1
|
1
| 2
| Oklahoma City Thunder
| 60-22
| 34-7
| 26-15
| 21-9
| 39-13
| 7-3
| 8-2
| 6-4
| ...
| 21-8
| 3-6
| 44-6
| NaN
| 13-4
| 11-2
| 11-5
| 7-4
| 12-5
| 6-2
|
2
| 3
| San Antonio Spurs
| 58-24
| 35-6
| 23-18
| 25-5
| 33-19
| 8-2
| 9-1
| 8-2
| ...
| 16-12
| 9-5
| 31-10
| 1-0
| 12-4
| 12-4
| 12-3
| 8-3
| 10-4
| 3-6
|
3
| 4
| Denver Nuggets
| 57-25
| 38-3
| 19-22
| 19-11
| 38-14
| 5-5
| 10-0
| 4-6
| ...
| 24-4
| 11-7
| 28-8
| 0-1
| 8-8
| 9-6
| 12-3
| 8-4
| 13-2
| 7-1
|
4
| 5
| Los Angeles Clippers
| 56-26
| 32-9
| 24-17
| 21-9
| 35-17
| 7-3
| 8-2
| 6-4
| ...
| 17-9
| 3-5
| 38-12
| 1-0
| 8-6
| 16-0
| 9-7
| 8-5
| 7-7
| 7-1
|
5 rows × 24 columns
#这里好像所有的特征都转变为只有几类,例如True和false ,不然那个信息增益要算很多
# We can create a new feature -- HomeTeamRanksHigher\
results["HomeTeamRanksHigher"] = 0
for index, row in results.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
if home_team == "New Orleans Pelicans":
home_team = "New Orleans Hornets"
elif visitor_team == "New Orleans Pelicans":
visitor_team = "New Orleans Hornets"
home_rank = ladder[ladder["Team"] == home_team]["Rk"].values[0]
visitor_rank = ladder[ladder["Team"] == visitor_team]["Rk"].values[0]
row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)
results.iloc[index] = row
results[:5]
| Date
| Score Type
| Visitor Team
| VisitorPts
| Home Team
| HomePts
| OT?
| Notes
| HomeWin
| HomeLastWin
| VisitorLastWin
| HomeWinStreak
| VisitorWinStreak
| HomeTeamRanksHigher
|
0
| Tue Oct 29 2013
| Box Score
| Los Angeles Clippers
| 103
| Los Angeles Lakers
| 116
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 1
|
1
| Tue Oct 29 2013
| Box Score
| Chicago Bulls
| 95
| Miami Heat
| 107
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 0
|
2
| Wed Oct 30 2013
| Box Score
| Brooklyn Nets
| 94
| Cleveland Cavaliers
| 98
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 1
|
3
| Wed Oct 30 2013
| Box Score
| Atlanta Hawks
| 109
| Dallas Mavericks
| 118
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 1
|
4
| Wed Oct 30 2013
| Box Score
| Washington Wizards
| 102
| Detroit Pistons
| 113
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 0
|
X_homehigher = results[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_homehigher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 60.2%
from sklearn.model_selection import GridSearchCV
parameter_space = {
"max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
}
clf = DecisionTreeClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_homehigher, y_true)
print("准确率: {0:.1f}%".format(grid.best_score_ * 100))
#谁赢了最后一场比赛?我们忽略了家/访客这一点
last_match_winner = defaultdict(int)
results['HomeTeamWonLast'] = 0
for index, row in results.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
teams = tuple(sorted([home_team, visitor_team])) # Sort for a consistent ordering
# 在当前行中记录上次交手的胜方
row["HomeTeamWonLast"] = 1 if last_match_winner[teams] == row["Home Team"] else 0
results.loc[index] = row
# 本次比赛的胜方
winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
last_match_winner[teams] = winner
results.loc[:5]
| Date
| Score Type
| Visitor Team
| VisitorPts
| Home Team
| HomePts
| OT?
| Notes
| HomeWin
| HomeLastWin
| VisitorLastWin
| HomeWinStreak
| VisitorWinStreak
| HomeTeamRanksHigher
| HomeTeamWonLast
|
0
| Tue Oct 29 2013
| Box Score
| Los Angeles Clippers
| 103
| Los Angeles Lakers
| 116
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 1
| 0
|
1
| Tue Oct 29 2013
| Box Score
| Chicago Bulls
| 95
| Miami Heat
| 107
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 0
| 0
|
2
| Wed Oct 30 2013
| Box Score
| Brooklyn Nets
| 94
| Cleveland Cavaliers
| 98
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 1
| 0
|
3
| Wed Oct 30 2013
| Box Score
| Atlanta Hawks
| 109
| Dallas Mavericks
| 118
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 1
| 0
|
4
| Wed Oct 30 2013
| Box Score
| Washington Wizards
| 102
| Detroit Pistons
| 113
| NaN
| NaN
| True
| 0
| 0
| 0
| 0
| 0
| 0
|
5
| Wed Oct 30 2013
| Box Score
| Los Angeles Lakers
| 94
| Golden State Warriors
| 125
| NaN
| NaN
| True
| 0
| True
| 0
| 1
| 0
| 0
|
X_home_higher = results[["HomeTeamRanksHigher", "HomeTeamWonLast"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_home_higher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 60.5%
最后我们来看一下,决策树在训练数据量很大的情况下,能否得到有效的分类模型。我们将
会为决策树添加球队,以检测它是否能整合新增的信息。
虽然决策树能够处理特征值为类别型的数据,但scikit-learn库所实现的决策树算法要求
先对这类特征进行处理。用LabelEncoder转换器就能把字符串类型的球队名转化为整型。代码
如下
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
encoding = LabelEncoder()
encoding.fit(results["Home Team"].values)
home_teams = encoding.transform(results["Home Team"].values)
visitor_teams = encoding.transform(results["Visitor Team"].values)
X_teams = np.vstack([home_teams, visitor_teams]).T
决策树可以用这些特征值进行训练,但DecisionTreeClassifier仍把它们当作连续型特
征。例如,编号从0到16的17支球队,算法会认为球队1和2相似,而球队4和10不同。但其实这没
意义,对于两支球队而言,它们要么是同一支球队,要么不同,没有中间状态!
为了消除这种和实际情况不一致的现象,我们可以使用OneHotEncoder转换器把这些整数转
换为二进制数字。每个特征用一个二进制数字①来表示。例如,LabelEncoder为芝加哥公牛队分配
的数值是7,那么OneHotEncoder为它分配的二进制数字的第七位就是1,其余队伍的第七位就是0。
每个可能的特征值都这样处理,而数据集会变得很大。代码如下:
onehot = OneHotEncoder()
X_teams = onehot.fit_transform(X_teams).todense()
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
正确率为60%,比基准值要高,但是没有之前的效果好。原因可能在于特征数增加后,决策
树处理不当。鉴于此,我们尝试修改算法,看看会不会起作用。数据挖掘有时就是不断尝试新算
法、使用新特征这样一个过程。
3.4 随机森林
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Using full team labels is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using full team labels is ranked higher
准确率: 61.5%
X_all = np.hstack([X_home_higher,X_teams])
print(X_all.shape)
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 62.9%
我们也可以尝试CridSearchCV类的其他参数
parameter_space = {
"max_features": [2, 10, 'auto'],
"n_estimators": [100,],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}
clf = RandomForestClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print("准确率: {0:.1f}%".format(grid.best_score_ * 100))
print(grid.best_estimator_)
准确率: 65.4%
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='entropy', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=6, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=14, verbose=0,
warm_start=False)
参考文献
<<机器学习>> --周志华
<<数据挖掘概念与技术>> 中文版的