随机森林回归特征选择随机森林回归方程

转载

mob64ca140fd7c1 2024-07-21 19:42:14

文章标签 随机森林回归特征选择 yarn python 随机森林递归 文章分类 机器学习人工智能

文章目录

概要
整体架构流程
技术名词解释
技术细节
小结

概要

在本项目中，我们探讨了如何利用随机森林回归模型和递归特征消除(RFECV)来选择特征，并预测数据集中的目标变量。这个过程涉及到数据预处理，模型训练，特征重要性评估，以及最终的结果可视化。

整体架构流程

数据处理和分析的整个流程分为几个主要部分：

数据清洗：处理缺失值和非数值错误。
特征选择：使用RFECV方法选择最重要的特征。
模型训练：使用随机森林回归器对数据进行拟合。
结果可视化：通过图形展示不同特征数量对应的模型性能。

技术名词解释

随机森林（Random Forest）：一个由多个决策树构成的集成学习方法，用于分类和回归。
递归特征消除（Recursive Feature Elimination, RFECV）：一种特征选择方法，通过递归减少特征量来选择最重要的特征。
KFold：交叉验证方法之一，将数据集分成K个子集，进行多次训练和验证。

技术细节

在本代码中，我们使用了RandomForestRegressor作为基础估计器，并通过RFECV实施了特征选择。KFold(5)定义了5折交叉验证过程。

# -*- coding: utf-8 -*-
'''
 @project: pythonProject
 @Author：大营
 @file： code.py
 @date：2024/3/27 19:43
 @WeChat: dD-Q1595031248
 '''



import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False



data_path = '特征变量1.csv'
data = pd.read_csv(data_path)

# 替换 '#DIV/0!' 错误为NaN，并填充缺失值
data_cleaned = data.replace('#DIV/0!', np.nan).astype(float)
data_cleaned = data_cleaned.fillna(data_cleaned.mean())

# 准备数据
X_clean = data_cleaned.drop('incident_test', axis=1)
y_clean = data_cleaned['incident_test']

# 初始化随机森林回归器
rf_regressor = RandomForestRegressor(random_state=42)

# 使用RFECV进行递归特征消除，采用KFold
rfecv_regressor = RFECV(estimator=rf_regressor, step=1, cv=KFold(5), scoring='neg_mean_squared_error', min_features_to_select=1)
rfecv_regressor.fit(X_clean, y_clean)

# 获取特征重要性并排序
feature_importances_corrected = rfecv_regressor.estimator_.feature_importances_
sorted_idx_corrected = np.argsort(feature_importances_corrected)[::-1]

# 绘制修正后的特征重要性累计贡献图
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importances_corrected)), feature_importances_corrected[sorted_idx_corrected])
plt.xlabel('特征（排序后）')
plt.ylabel('特征重要性')
plt.title('特征重要性累计贡献图')
plt.show()

# 绘制特征累计重要性图
cumulative_importances_corrected = np.cumsum(feature_importances_corrected[sorted_idx_corrected])
plt.figure(figsize=(10, 6))
plt.plot(range(len(feature_importances_corrected)), cumulative_importances_corrected, 'b-')
plt.xlabel('特征数量（排序后）')
plt.ylabel('累计重要性')
plt.title('特征累计重要性图')
plt.hlines(y=0.95, xmin=0, xmax=len(feature_importances_corrected), color='r', linestyles='dashed')
plt.show()




# 初始化随机森林回归器
rf_regressor = RandomForestRegressor(random_state=42)

rfecv_regressor = RFECV(estimator=rf_regressor, step=1, cv=KFold(5), scoring='neg_mean_squared_error', min_features_to_select=1)
rfecv_regressor.fit(X_clean, y_clean)

cv_scores = rfecv_regressor.cv_results_['mean_test_score']

# 绘制柱状图
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(cv_scores) + 1), cv_scores)
plt.xlabel('特征数量')
plt.ylabel('交叉验证得分（均方误差）')
plt.title('特征数量与模型预测精度的关系')
plt.xticks(range(1, len(cv_scores) + 1))  # 确保每个条形都有一个刻度
plt.tight_layout()  # 确保中文标题显示完整
plt.show()
# 列出每个特征个数对应的特征
features_per_count = {}
for i in range(1, len(rfecv_regressor.support_) + 1):
    features_per_count[i] = list(X_clean.columns[rfecv_regressor.ranking_ <= i])

print(features_per_count)