决策树是一种常用的机器学习算法,用于解决分类和回归问题。
决策树算法根据特征的属性值来将数据集分成多个子集,通过对每个子集重复这个过程,最终得到一棵树形结构,其中每个内部节点表示一个特征,每个叶子节点表示一个分类或回归结果。
决策树的主要优点是易于理解和解释,可以处理离散和连续的数据,不需要对数据进行太多的预处理,同时在训练过程中能够自动选择最重要的特征。
决策树算法的基本流程如下:
- 选择一个特征作为根节点,根据该特征的属性值将数据集分成多个子集;
- 对每个子集重复步骤 1,直到所有叶子节点的样本属于同一类别或达到了预定的树深度或样本数目的限制;
- 对新数据进行分类或回归,沿着树的分支逐级判断,直到到达叶子节点。
下面是一个用PyTorch实现决策树的示例代码,以Iris数据集为例:
import torch
from torch import nn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 定义决策树模型
class DecisionTree(nn.Module):
def __init__(self, depth, num_classes):
super().__init__()
self.depth = depth
self.num_classes = num_classes
self.feature_indices = torch.arange(X_train.size(1))
self.root = self.build_tree(X_train, y_train, 0)
def build_tree(self, X, y, depth):
# 如果所有样本属于同一类别或达到了树的最大深度,返回叶子节点
if len(set(y)) == 1 or depth == self.depth:
leaf_value = torch.bincount(y, minlength=self.num_classes).argmax()
return {'leaf': True, 'value': leaf_value}
# 否则选择最优特征进行分裂
best_feature, best_threshold = self.get_best_split(X, y)
left_indices = X[:, best_feature] <= best_threshold
right_indices = X[:, best_feature] > best_threshold
# 如果分裂后的样本数目小于一定阈值,返回叶子节点
if len(left_indices) == 0 or len(right_indices) == 0:
leaf_value = torch.bincount(y, minlength=self.num_classes).argmax()
return {'leaf': True, 'value': leaf_value}
# 否则继续递归分裂左右子树
left_subtree = self.build_tree(X[left_indices], y[left_indices], depth+1)
right_subtree = self.build_tree(X[right_indices], y[right_indices], depth+1)
return {'leaf': False, 'feature': best_feature, 'threshold': best_threshold, 'left': left_subtree, 'right': right_subtree}
def get_best_split(self, X, y):
best_feature, best_threshold, best_score = None, None, -1
for feature in self.feature_indices:
thresholds = torch.unique(X[:, feature])
for threshold in thresholds:
left_indices = X[:, feature] <= threshold
right_indices = X[:, feature] > threshold
if len(left_indices) == 0 or len(right_indices) == 0:
continue
score = self.gini_impurity(y, left_indices, right_indices)
if score > best_score:
best_feature, best_threshold, best_score = feature, threshold, score
return best_feature, best_threshold
def gini_impurity(self, y, left_indices, right_indices):
left_gini = self.calculate_gini(y[left_indices])
right_gini = self.calculate_gini(y[right_indices])
left_weight = len(left_indices) / len(y)
right_weight = len(right_indices) / len(y)
return left_weight * left_gini + right_weight * right_gini
def calculate_gini(self, y):
_, counts = torch.unique(y, return_counts=True)
proportions = counts / len(y)
gini = 1.0 - torch.sum(proportions ** 2)
return gini
def predict(self, X):
return torch.tensor([self.traverse_tree(x, self.root) for x in X])
def traverse_tree(self, x, node):
if node['leaf']:
return node['value']
if x[node['feature']] <= node['threshold']:
return self.traverse_tree(x, node['left'])
else:
return self.traverse_tree(x, node['right'])
接下来实例化并训练
model = DecisionTree(depth=2, num_classes=3)
y_pred = torch.zeros_like(y_test)
for i in range(X_test.size(0)):
y_pred[i] = model.predict(X_test[i])
accuracy = (y_pred == y_test).float().mean()
print(f'Test accuracy: {accuracy.item():.3f}')
创建了一个深度为2
,分类数为3
的决策树模型,然后遍历测试集中的每一个样本,对其进行预测,并将预测结果存储在y_pred
中。最后计算测试集的准确率并输出。