文章目录

  • ​​思维脑图​​
  • ​​3.1.2 用pandas加载数据集​​
  • ​​3.1.3 清洗数据集​​
  • ​​现在计算这些的实际值​​
  • ​​主队和客队最后一场比赛赢了吗?​​
  • ​​3.2 决策树​​
  • ​​3.2.1 决策树中的参数​​
  • ​​3.2.2 决策树的使用|​​
  • ​​3.3 体育赛事结果预测​​
  • ​​3.4 随机森林​​

  • ​​参考文献​​

思维脑图

python 中 机器学习算法  --决策树_ci

import os
import numpy as np
import pandas as pd
home_folder = "./PythonDataMining/"
data_folder = os.path.join(home_folder,'data')
data_filename = os.path.join(data_folder, "leagues_NBA_2014_games_games.csv")

3.1.2 用pandas加载数据集

results = pd.read_csv(data_filename)
results.iloc[:5]



Date

Score Type

Visitor Team

VisitorPts

Home Team

HomePts

OT?

Notes

0

Tue Oct 29 2013

Box Score

Orlando Magic

87

Indiana Pacers

97

NaN

NaN

1

Tue Oct 29 2013

Box Score

Los Angeles Clippers

103

Los Angeles Lakers

116

NaN

NaN

2

Tue Oct 29 2013

Box Score

Chicago Bulls

95

Miami Heat

107

NaN

NaN

3

Wed Oct 30 2013

Box Score

Brooklyn Nets

94

Cleveland Cavaliers

98

NaN

NaN

4

Wed Oct 30 2013

Box Score

Atlanta Hawks

109

Dallas Mavericks

118

NaN

NaN

3.1.3 清洗数据集

results = pd.read_csv(data_filename, skiprows=[0,])
# Fix the name of the columns
results.columns = ["Date", "Score Type", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Notes"]
results.iloc[:5]



Date

Score Type

Visitor Team

VisitorPts

Home Team

HomePts

OT?

Notes

0

Tue Oct 29 2013

Box Score

Los Angeles Clippers

103

Los Angeles Lakers

116

NaN

NaN

1

Tue Oct 29 2013

Box Score

Chicago Bulls

95

Miami Heat

107

NaN

NaN

2

Wed Oct 30 2013

Box Score

Brooklyn Nets

94

Cleveland Cavaliers

98

NaN

NaN

3

Wed Oct 30 2013

Box Score

Atlanta Hawks

109

Dallas Mavericks

118

NaN

NaN

4

Wed Oct 30 2013

Box Score

Washington Wizards

102

Detroit Pistons

113

NaN

NaN

results['HomeWin'] = results['VisitorPts'] < results['HomePts']
y_true = results['HomeWin'].values
results.iloc[:5]



Date

Score Type

Visitor Team

VisitorPts

Home Team

HomePts

OT?

Notes

HomeWin

0

Tue Oct 29 2013

Box Score

Los Angeles Clippers

103

Los Angeles Lakers

116

NaN

NaN

True

1

Tue Oct 29 2013

Box Score

Chicago Bulls

95

Miami Heat

107

NaN

NaN

True

2

Wed Oct 30 2013

Box Score

Brooklyn Nets

94

Cleveland Cavaliers

98

NaN

NaN

True

3

Wed Oct 30 2013

Box Score

Atlanta Hawks

109

Dallas Mavericks

118

NaN

NaN

True

4

Wed Oct 30 2013

Box Score

Washington Wizards

102

Detroit Pistons

113

NaN

NaN

True

print("Home Win 百分比: {0:.1f}%".format(100 * results["HomeWin"].sum() / results["HomeWin"].count()))
results["HomeLastWin"] = False
results["VisitorLastWin"] = False
# This creates two new columns, all set to False
results.iloc[:5]
Home Win 百分比: 58.0%



Date

Score Type

Visitor Team

VisitorPts

Home Team

HomePts

OT?

Notes

HomeWin

HomeLastWin

VisitorLastWin

0

Tue Oct 29 2013

Box Score

Los Angeles Clippers

103

Los Angeles Lakers

116

NaN

NaN

True

False

False

1

Tue Oct 29 2013

Box Score

Chicago Bulls

95

Miami Heat

107

NaN

NaN

True

False

False

2

Wed Oct 30 2013

Box Score

Brooklyn Nets

94

Cleveland Cavaliers

98

NaN

NaN

True

False

False

3

Wed Oct 30 2013

Box Score

Atlanta Hawks

109

Dallas Mavericks

118

NaN

NaN

True

False

False

4

Wed Oct 30 2013

Box Score

Washington Wizards

102

Detroit Pistons

113

NaN

NaN

True

False

False

现在计算这些的实际值

主队和客队最后一场比赛赢了吗?

# Now compute the actual values for these
# Did the home and visitor teams win their last game?
from collections import defaultdict
won_last = defaultdict(int)

for index, row in results.iterrows(): # Note that this is not efficient
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeLastWin"] = won_last[home_team]
row["VisitorLastWin"] = won_last[visitor_team]
results.iloc[index] = row
# Set current win
won_last[home_team] = row["HomeWin"]
won_last[visitor_team] = not row["HomeWin"]
results.iloc[20:25]



Date

Score Type

Visitor Team

VisitorPts

Home Team

HomePts

OT?

Notes

HomeWin

HomeLastWin

VisitorLastWin

20

Fri Nov 1 2013

Box Score

Miami Heat

100

Brooklyn Nets

101

NaN

NaN

True

False

False

21

Fri Nov 1 2013

Box Score

Cleveland Cavaliers

84

Charlotte Bobcats

90

NaN

NaN

True

False

True

22

Fri Nov 1 2013

Box Score

Portland Trail Blazers

113

Denver Nuggets

98

NaN

NaN

False

False

False

23

Fri Nov 1 2013

Box Score

Dallas Mavericks

105

Houston Rockets

113

NaN

NaN

True

True

True

24

Fri Nov 1 2013

Box Score

San Antonio Spurs

91

Los Angeles Lakers

85

NaN

NaN

False

False

True

3.2 决策树

决策树是一种有监督的机器学习算法,它看起来就像是由一系列节点组成的流程图,其中位
于上层节点的值决定下一步走向哪个节点。

%%html
<img src = './image/决策树1.png',width=100,height=100>

<img src = ‘./image/决策树1.png’,width=100,height=100>

跟大多数分类算法一样,决策树也分为两大步骤。
 首先是训练阶段,用训练数据构造一棵树。上一章的近邻算法没有训练阶段,但是决策
树需要。从这个意义上说,近邻算法是一种惰性算法,在用它进行分类时,它才开始干
活。相反,决策树跟大多数机器学习方法类似,是一种积极学习的算法,在训练阶段完
成模型的创建。
 其次是预测阶段,用训练好的决策树预测新数据的类别。以上图为例,[“is raining”,
“very windy”]的预测结果为“Bad”(坏天气)。
创建决策树的算法有多种,大都通过迭代生成一棵树。它们从根节点开始,选取最佳特征,
用于第一个决策,到达下一个节点,选择下一个最佳特征,以此类推。当发现无法从增加树的层
级中获得更多信息时,算法启动退出机制。
scikit-learn库实现了分类回归树(Classification and Regression Trees,CART)算法并将
其作为生成决策树的默认算法,它支持连续型特征和类别型特征。

3.2.1 决策树中的参数

3.2.2 决策树的使用|

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=14)

from sklearn.model_selection import cross_val_score

X_previouswins = results[['HomeLastWin','VisitorLastWin']].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf,X_previouswins,y_true,scoring = 'accuracy')
print('Using just the last result from the home and visitor teams')
print('Accuracy: {0:.1f}%'.format(np.mean(scores)*100))
Using just the last result from the home and visitor teams
Accuracy: 59.1%

3.3 体育赛事结果预测

# What about win streaks?
results["HomeWinStreak"] = 0
results["VisitorWinStreak"] = 0
# Did the home and visitor teams win their last game?
from collections import defaultdict
win_streak = defaultdict(int)

for index, row in results.iterrows(): # Note that this is not efficient
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeWinStreak"] = win_streak[home_team]
row["VisitorWinStreak"] = win_streak[visitor_team]
results.loc[index] = row
# Set current win
if row["HomeWin"]:
win_streak[home_team] += 1
win_streak[visitor_team] = 0
else:
win_streak[home_team] = 0
win_streak[visitor_team] += 1
clf = DecisionTreeClassifier(random_state=14)
X_winstreak = results[["HomeLastWin", "VisitorLastWin", "HomeWinStreak", "VisitorWinStreak"]].values
scores = cross_val_score(clf, X_winstreak, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
Accuracy: 58.4%

我们试试看哪个队在阶梯上更好。使用上一年的梯子

ladder_filename = os.path.join(data_folder, "leagues_NBA_2013_standings_expanded-standings.csv")
ladder = pd.read_csv(ladder_filename)
ladder.head()



Rk

Team

Overall

Home

Road

E

W

A

C

SE

...

Post

≤3

≥10

Oct

Nov

Dec

Jan

Feb

Mar

Apr

0

1

Miami Heat

66-16

37-4

29-12

41-11

25-5

14-4

12-6

15-1

...

30-2

9-3

39-8

1-0

10-3

10-5

8-5

12-1

17-1

8-1

1

2

Oklahoma City Thunder

60-22

34-7

26-15

21-9

39-13

7-3

8-2

6-4

...

21-8

3-6

44-6

NaN

13-4

11-2

11-5

7-4

12-5

6-2

2

3

San Antonio Spurs

58-24

35-6

23-18

25-5

33-19

8-2

9-1

8-2

...

16-12

9-5

31-10

1-0

12-4

12-4

12-3

8-3

10-4

3-6

3

4

Denver Nuggets

57-25

38-3

19-22

19-11

38-14

5-5

10-0

4-6

...

24-4

11-7

28-8

0-1

8-8

9-6

12-3

8-4

13-2

7-1

4

5

Los Angeles Clippers

56-26

32-9

24-17

21-9

35-17

7-3

8-2

6-4

...

17-9

3-5

38-12

1-0

8-6

16-0

9-7

8-5

7-7

7-1

5 rows × 24 columns

#这里好像所有的特征都转变为只有几类,例如True和false ,不然那个信息增益要算很多

# We can create a new feature -- HomeTeamRanksHigher\
results["HomeTeamRanksHigher"] = 0
for index, row in results.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
if home_team == "New Orleans Pelicans":
home_team = "New Orleans Hornets"
elif visitor_team == "New Orleans Pelicans":
visitor_team = "New Orleans Hornets"
home_rank = ladder[ladder["Team"] == home_team]["Rk"].values[0]
visitor_rank = ladder[ladder["Team"] == visitor_team]["Rk"].values[0]
row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)
results.iloc[index] = row
results[:5]



Date

Score Type

Visitor Team

VisitorPts

Home Team

HomePts

OT?

Notes

HomeWin

HomeLastWin

VisitorLastWin

HomeWinStreak

VisitorWinStreak

HomeTeamRanksHigher

0

Tue Oct 29 2013

Box Score

Los Angeles Clippers

103

Los Angeles Lakers

116

NaN

NaN

True

0

0

0

0

1

1

Tue Oct 29 2013

Box Score

Chicago Bulls

95

Miami Heat

107

NaN

NaN

True

0

0

0

0

0

2

Wed Oct 30 2013

Box Score

Brooklyn Nets

94

Cleveland Cavaliers

98

NaN

NaN

True

0

0

0

0

1

3

Wed Oct 30 2013

Box Score

Atlanta Hawks

109

Dallas Mavericks

118

NaN

NaN

True

0

0

0

0

1

4

Wed Oct 30 2013

Box Score

Washington Wizards

102

Detroit Pistons

113

NaN

NaN

True

0

0

0

0

0

X_homehigher =  results[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_homehigher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 60.2%
from sklearn.model_selection import GridSearchCV

parameter_space = {
"max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
}
clf = DecisionTreeClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_homehigher, y_true)
print("准确率: {0:.1f}%".format(grid.best_score_ * 100))
准确率: 60.5%

#谁赢了最后一场比赛?我们忽略了家/访客这一点

last_match_winner = defaultdict(int)
results['HomeTeamWonLast'] = 0

for index, row in results.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
teams = tuple(sorted([home_team, visitor_team])) # Sort for a consistent ordering
# 在当前行中记录上次交手的胜方
row["HomeTeamWonLast"] = 1 if last_match_winner[teams] == row["Home Team"] else 0
results.loc[index] = row
# 本次比赛的胜方
winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
last_match_winner[teams] = winner
results.loc[:5]



Date

Score Type

Visitor Team

VisitorPts

Home Team

HomePts

OT?

Notes

HomeWin

HomeLastWin

VisitorLastWin

HomeWinStreak

VisitorWinStreak

HomeTeamRanksHigher

HomeTeamWonLast

0

Tue Oct 29 2013

Box Score

Los Angeles Clippers

103

Los Angeles Lakers

116

NaN

NaN

True

0

0

0

0

1

0

1

Tue Oct 29 2013

Box Score

Chicago Bulls

95

Miami Heat

107

NaN

NaN

True

0

0

0

0

0

0

2

Wed Oct 30 2013

Box Score

Brooklyn Nets

94

Cleveland Cavaliers

98

NaN

NaN

True

0

0

0

0

1

0

3

Wed Oct 30 2013

Box Score

Atlanta Hawks

109

Dallas Mavericks

118

NaN

NaN

True

0

0

0

0

1

0

4

Wed Oct 30 2013

Box Score

Washington Wizards

102

Detroit Pistons

113

NaN

NaN

True

0

0

0

0

0

0

5

Wed Oct 30 2013

Box Score

Los Angeles Lakers

94

Golden State Warriors

125

NaN

NaN

True

0

True

0

1

0

0

X_home_higher =  results[["HomeTeamRanksHigher", "HomeTeamWonLast"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_home_higher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 60.5%

最后我们来看一下,决策树在训练数据量很大的情况下,能否得到有效的分类模型。我们将
会为决策树添加球队,以检测它是否能整合新增的信息。
虽然决策树能够处理特征值为类别型的数据,但scikit-learn库所实现的决策树算法要求
先对这类特征进行处理。用LabelEncoder转换器就能把字符串类型的球队名转化为整型。代码
如下

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
encoding = LabelEncoder()
encoding.fit(results["Home Team"].values)
home_teams = encoding.transform(results["Home Team"].values)
visitor_teams = encoding.transform(results["Visitor Team"].values)
X_teams = np.vstack([home_teams, visitor_teams]).T

决策树可以用这些特征值进行训练,但DecisionTreeClassifier仍把它们当作连续型特
征。例如,编号从0到16的17支球队,算法会认为球队1和2相似,而球队4和10不同。但其实这没
意义,对于两支球队而言,它们要么是同一支球队,要么不同,没有中间状态!
为了消除这种和实际情况不一致的现象,我们可以使用OneHotEncoder转换器把这些整数转
换为二进制数字。每个特征用一个二进制数字①来表示。例如,LabelEncoder为芝加哥公牛队分配
的数值是7,那么OneHotEncoder为它分配的二进制数字的第七位就是1,其余队伍的第七位就是0。
每个可能的特征值都这样处理,而数据集会变得很大。代码如下:

onehot = OneHotEncoder()
X_teams = onehot.fit_transform(X_teams).todense()

clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
准确率: 60.1%

正确率为60%,比基准值要高,但是没有之前的效果好。原因可能在于特征数增加后,决策
树处理不当。鉴于此,我们尝试修改算法,看看会不会起作用。数据挖掘有时就是不断尝试新算
法、使用新特征这样一个过程。

3.4 随机森林

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Using full team labels is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using full team labels is ranked higher
准确率: 61.5%
X_all = np.hstack([X_home_higher,X_teams])
print(X_all.shape)
(1229, 62)
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 62.9%

我们也可以尝试CridSearchCV类的其他参数

parameter_space = {
"max_features": [2, 10, 'auto'],
"n_estimators": [100,],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}
clf = RandomForestClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print("准确率: {0:.1f}%".format(grid.best_score_ * 100))
print(grid.best_estimator_)
准确率: 65.4%
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='entropy', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=6, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=14, verbose=0,
warm_start=False)

参考文献

<<机器学习>> --周志华
<<数据挖掘概念与技术>> 中文版的