机器学习算法原理实现——adaboost，三个臭皮匠顶个诸葛亮

精选原创

bonelee 2023-11-15 10:26:25 ©著作权

文章标签 机器学习权重初始化数据集 文章分类 JavaScript 前端开发

©著作权归作者所有：来自51CTO博客作者bonelee的原创作品，请联系作者获取转载授权，否则将追究法律责任

adaboost算法的基本原理是什么？举一个简单的例子说明呢

AdaBoost（Adaptive Boosting）是一种集成学习方法，其基本原理是结合多个弱学习器来构建一个强学习器。AdaBoost的工作方式如下：

权重初始化：给定一个训练数据集，首先为每个训练样本分配一个权重，开始时这些权重都是相等的。
训练弱学习器：在每个迭代中，使用权重调整的数据集来训练一个弱学习器。
计算错误率：使用当前的弱学习器对数据集进行预测，然后计算加权的错误率。
计算学习器权重：基于当前弱学习器的错误率计算其权重，误差率较低的弱学习器获得较大的权重。
更新样本权重：增加那些被错误分类的样本的权重，并减少那些被正确分类的样本的权重。
迭代：重复上述步骤，直到满足迭代次数或错误率达到预定的阈值。
组合弱学习器：所有的弱学习器以其权重为基础进行组合，形成一个强学习器。

其实本质就是“三个臭皮匠（ABC），顶个诸葛亮”：

假设A对1/3样本预测正确擅长，则他对这部分样本的预测权重会增加，B和C对剩下的样本预测正确，则另外的样本权重会增加，则adaboost本质上就是集成了三人的智慧，让他们三个人的长处和优势都发挥出来！

说下里面的adaboost公式推导和计算流程：

机器学习算法原理实现——adaboost，三个臭皮匠顶个诸葛亮_数据集

代码实现：

import numpy as np


### 定义决策树桩类
### 作为AdaBoost弱分类器
class DecisionStump:
    def __init__(self):
        # 基于划分阈值决定样本分类为1还是-1
        self.label = 1
        # 特征索引
        self.feature_index = None
        # 特征划分阈值
        self.threshold = None
        # 指示分类准确率的值
        self.alpha = None

### 定义Adaboost类
class Adaboost:
    # 弱分类器个数
    def __init__(self, n_estimators=5):
        self.n_estimators = n_estimators
    # AdaBoost拟合算法
    def fit(self, X, y):
        m, n = X.shape
        # (1)初始化权重分布为均匀分布1/N
        w = np.full(m, (1/m))
        # 初始化基分类器列表
        self.estimators = []
        # (2) for m in (1,2,...,M)
        for _ in range(self.n_estimators):
            # (2.a) 训练一个弱分类器：决策树桩
            estimator = DecisionStump()
            # 设定一个最小化误差率
            min_error = float('inf')
            # 遍历数据集特征，根据最小分类误差率选择最优特征
            for i in range(n):
                # 获取特征值
                values = np.expand_dims(X[:, i], axis=1)
                # 特征取值去重
                unique_values = np.unique(values)
                # 尝试将每一个特征值作为分类阈值
                for threshold in unique_values:
                    p = 1
                    # 初始化所有预测值为1
                    pred = np.ones(np.shape(y))
                    # 小于分类阈值的预测值为-1
                    pred[X[:, i] < threshold] = -1
                    # (2.b) 计算误差率
                    error = sum(w[y != pred])
                    # 如果分类误差率大于0.5，则进行正负预测翻转
                    # 例如 error = 0.6 => (1 - error) = 0.4
                    if error > 0.5:
                        error = 1 - error
                        p = -1
                    # 一旦获得最小误差率，则保存相关参数配置
                    if error < min_error:
                        estimator.label = p
                        estimator.threshold = threshold
                        estimator.feature_index = i
                        min_error = error
            # (2.c) 计算基分类器的权重
            estimator.alpha = 0.5 * np.log((1.0 - min_error) /
                                           (min_error + 1e-9))
            # 初始化所有预测值为1
            preds = np.ones(np.shape(y))
            # 获取所有小于阈值的负类索引
            negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label *
                            estimator.threshold)
            # 将负类设为'-1'
            preds[negative_idx] = -1
            # (2.d) 更新样本权重
            w *= np.exp(-estimator.alpha * y * preds)
            w /= np.sum(w)
            # 保存该弱分类器
            self.estimators.append(estimator)

    # 定义预测函数
    def predict(self, X):
        m = len(X)
        y_pred = np.zeros((m, 1))
        # 计算每个弱分类器的预测值
        for estimator in self.estimators:
            # 初始化所有预测值为1
            predictions = np.ones(np.shape(y_pred))
            # 获取所有小于阈值的负类索引
            negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label *
                            estimator.threshold)
            # 将负类设为'-1'
            predictions[negative_idx] = -1
            # (2.e) 对每个弱分类器的预测结果进行加权
            y_pred += estimator.alpha * predictions
        # 返回最终预测结果
        y_pred = np.sign(y_pred).flatten()
        return y_pred
    
# 导入数据划分模块
from sklearn.model_selection import train_test_split
# 导入模拟二分类数据生成模块
from sklearn.datasets import make_blobs
# 导入准确率计算函数
from sklearn.metrics import accuracy_score
# 生成模拟二分类数据集
X, y =  make_blobs(n_samples=150, n_features=2, centers=2, cluster_std=1.2, random_state=40)
# 将标签转换为1/-1
y_ = y.copy()
y_[y_==0] = -1
y_ = y_.astype(float)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y_, test_size=0.3, random_state=43)
# 创建Adaboost模型实例
clf = Adaboost(n_estimators=5)
# 模型拟合
clf.fit(X_train, y_train)
# 模型预测
y_pred = clf.predict(X_test)
# 计算模型预测的分类准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of AdaBoost by numpy:", accuracy)


# 导入AdaBoostClassifier模块
from sklearn.ensemble import AdaBoostClassifier
# 创建模型实例
clf_ = AdaBoostClassifier(n_estimators=5, random_state=0)
# 模型拟合
clf_.fit(X_train, y_train)
# 测试集预测
y_pred_ = clf_.predict(X_test)
# 计算分类准确率
accuracy = accuracy_score(y_test, y_pred_)
print("Accuracy of AdaBoost by sklearn:", accuracy)

输出：

Accuracy of AdaBoost by numpy: 0.9777777777777777

Accuracy of AdaBoost by sklearn: 0.9777777777777777

解释：

机器学习算法原理实现——adaboost，三个臭皮匠顶个诸葛亮_机器学习_02