Adaboost
适用问题:二分类问题
- 模型:加法模型
- 策略:损失函数为指数函数
- 算法:前向分步算法
特点:AdaBoost算法的特点是通过迭代每次学习一个基本分类器。每次迭代中,提高那些被前一轮分类器错误分类数据的权值,而降低那些被正确分类的数据的权值。最后,AdaBoost将基本分类器的线性组合作为强分类器,其中给分类误差率小的基本分类器以大的权值,给分类误差率大的基本分类器以小的权值。
算法步骤:
1)给每个训练样本(\(x_{1},x_{2},….,x_{N}\))分配权重,初始权重\(w_{1}\)均为1/N。
2)针对带有权值的样本进行训练,得到模型\(G_m\)(初始模型为G1)。
3)计算模型\(G_m\)的误分率\(e_m=\sum_{i=1}^Nw_iI(y_i\not= G_m(x_i))\) (误分率应小于0.5,否则将预测结果翻转即可得到误分率小于0.5的分类器)
4)计算模型\(G_m\)的系数\(\alpha_m=0.5\log[(1-e_m)/e_m]\)
5)根据误分率e和当前权重向量\(w_m\)更新权重向量\(w_{m+1}\)。
6)计算组合模型\(f(x)=\sum_{m=1}^M\alpha_mG_m(x_i)\)的误分率。
7)当组合模型的误分率或迭代次数低于一定阈值,停止迭代;否则,回到步骤2)
提升树
提升树是以分类树或回归树为基本分类器的提升方法。提升树被认为是统计学习中最有效的方法之一。
提升方法:将弱可学习算法提升为强可学习算法。提升方法通过反复修改训练数据的权值分布,构建一系列基本分类器(弱分类器),并将这些基本分类器线性组合,构成一个强分类器。AdaBoost算法是提升方法的一个代表。
AdaBoost源码实现
假设弱分类器由 \(x < v\) 或 \(x > v\) 产生,阈值\(v\)使该分类器在训练集上分类误差率最低。
import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt %matplotlib inline def create_data(): iris = load_iris() # 鸢尾花数据集 df = pd.DataFrame(iris.data, columns=iris.feature_names) df['label'] = iris.target data = np.array(df.iloc[:100, [0, 1, -1]]) # 取前一百个数据,只保留前两个特征 for d in data: if d[-1] == 0: d[-1] = -1 return data[:, :2], data[:, -1].astype(np.int)
class AdaBoost: def __init__(self, num_classifier, increment=0.5): """ num_classifier: 弱分类器的数量 increment: 在特征上寻找最优切分点时,搜索时每次的增加值(数据稀疏时建议根据样本点来选择) """ self.num_classifier = num_classifier self.increment = increment def fit(self, X, Y): self._init_args(X, Y) # 逐个训练分类器 for m in range(self.num_classifier): min_error, v_optimal, preds = float('INF'), None, None direct_split = None feature_idx = None # 选定的特征的列索引 # 遍历选择特征和切分点使得分类误差最小 for j in range(self.num_feature): feature_values = self.X[:, j] # 第j个特征对应的所有取值 _ret = self._get_optimal_split(feature_values) v_split, _direct_split, error, pred_labels = _ret if error < min_error: min_error = error v_optimal = v_split preds = pred_labels direct_split = _direct_split feature_idx = j # 计算分类型权重alpha alpha = self._cal_alpha(min_error) self.alphas.append(alpha) # 记录当前分类器G(x) self.classifiers.append((feature_idx, v_optimal, direct_split)) # 更新样本集合权值分布 self._update_weights(alpha, preds) def predict(self, x): res = 0.0 for i in range(len(self.classifiers)): idx, v, direct = self.classifiers[i] # 输入弱分类器进行分类 if direct == '>': output = 1 if x[idx] > v else -1 else: # direct == '<' output = -1 if x[idx] > v else 1 res += self.alphas[i] * output return 1 if res > 0 else -1 # sign(res) def score(self, X_test, Y_test): cnt = 0 for i, x in enumerate(X_test): if self.predict(x) == Y_test[i]: cnt += 1 return cnt / len(X_test) def _init_args(self, X, Y): self.X = X self.Y = Y self.N, self.num_feature = X.shape # N:样本数,num_feature:特征数量 # 初始时每个样本的权重均相同 self.weights = [1/self.N] * self.N # 弱分类器集合 self.classifiers = [] # 每个分类器G(x)的权重 self.alphas = [] def _update_weights(self, alpha, pred_labels): # 计算规范化因子Z Z = self._cal_norm_factor(alpha, pred_labels) for i in range(self.N): self.weights[i] = (self.weights[i] * np.exp(-1*alpha*self.Y[i]*pred_labels[i]) / Z) def _cal_alpha(self, error): return 0.5 * np.log((1-error)/error) def _cal_norm_factor(self, alpha, pred_labels): return sum([self.weights[i] * np.exp(-1*alpha*self.Y[i]*pred_labels[i]) for i in range(self.N)]) def _get_optimal_split(self, feature_values): error = float('INF') # 分类误差 pred_labels = [] # 分类结果 v_split_optimal = None # 当前特征的最优切割点 direct_split = None # 最优切割点的判别方向 max_v = max(feature_values) min_v = min(feature_values) num_step = (max_v - min_v + self.increment)/self.increment for i in range(int(num_step)): # 选取分割点 v_split = min_v + i * self.increment judge_direct = '>' preds = [1 if feature_values[k] > v_split else -1 for k in range(len(feature_values))] # 错误样本加权误差 weight_error = sum([self.weights[k] for k in range(self.N) if preds[k] != self.Y[k]]) # 计算分类标签翻转后的误差 preds_inv = [-p for p in preds] weight_error_inv = sum([self.weights[k] for k in range(self.N) if preds_inv[k] != self.Y[k]]) # 取较小误差的判别方向作为分类器的判别方向 if weight_error_inv < weight_error: preds = preds_inv weight_error = weight_error_inv judge_direct = '<' if weight_error < error: error = weight_error pred_labels = preds v_split_optimal = v_split direct_split = judge_direct return v_split_optimal, direct_split, error, pred_labels
测试模型准确率:
X, Y = create_data() res = [] for i in range(10): X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2) clf = AdaBoost(num_classifier=50) clf.fit(X_train, Y_train) res.append(clf.score(X_test, Y_test)) print('My AdaBoost: {}次的平均准确率: {:.3f}'.format(len(res), sum(res)/len(res)))
My AdaBoost: 10次的平均准确率: 0.970
sklearn库的AdaBoost实例
from sklearn.ensemble import AdaBoostClassifier res = [] for i in range(10): X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2) clf_sklearn = AdaBoostClassifier(n_estimators=50, learning_rate=0.5) clf_sklearn.fit(X_train, Y_train) res.append(clf_sklearn.score(X_test, Y_test)) print('sklearn AdaBoostClassifier: {}次的平均准确率: {:.3f}'.format( len(res), sum(res)/len(res)))
sklearn AdaBoostClassifier: 10次的平均准确率: 0.945