2万字阐述-Python 用 XGBoost 进行梯度提升的数据准备（收藏）

精选原创

公众号_python风控模型 2021-09-16 13:45:02 博主文章分类：python生物信息学 ©著作权

文章标签 xgboost 机器学习 python 乳腺癌 sklearn 文章分类 人工智能

©著作权归作者所有：来自51CTO博客作者公众号_python风控模型的原创作品，请联系作者获取转载授权，否则将追究法律责任

XGBoost 是一种流行的梯度提升实现，因为它的速度和性能。

在内部，XGBoost 模型将所有问题表示为仅将数值作为输入的回归预测建模问题。如果您的数据采用不同的形式，则必须将其准备为预期的格式。

今天讲解如何使用 Python 中的 XGBoost 库准备用于梯度提升的数据。

看完这篇文章你们会学习：

如何编码字符串输出变量以进行分类。
如何使用一种热编码准备分类输入变量。
如何使用 XGBoost 自动处理缺失数据。

2万字阐述-Python 用 XGBoost 进行梯度提升的数据准备（收藏）_sklearn

标签编码字符串类值

鸢尾花分类问题是具有字符串类值的问题的一个示例。

这是一个预测问题，其中给定鸢尾花的厘米测量值，任务是预测给定的花属于哪个物种。

下载数据集并将其放置在您当前的工作目录中，文件名为“ iris.csv ”。

鸢尾花数据集

5.1,3.5,1.4,0.2,Iris-setosa4.9,3.0,1.4,0.2,Iris-setosa4.7,3.2,1.3,0.2,Iris-setosa4.6,3.1,1.5,0.2,Iris-setosa5.0,3.6,1.4,0.2,Iris-setosa5.4,3.9,1.7,0.4,Iris-setosa4.6,3.4,1.4,0.3,Iris-setosa5.0,3.4,1.5,0.2,Iris-setosa4.4,2.9,1.4,0.2,Iris-setosa4.9,3.1,1.5,0.1,Iris-setosa5.4,3.7,1.5,0.2,Iris-setosa4.8,3.4,1.6,0.2,Iris-setosa4.8,3.0,1.4,0.1,Iris-setosa4.3,3.0,1.1,0.1,Iris-setosa5.8,4.0,1.2,0.2,Iris-setosa5.7,4.4,1.5,0.4,Iris-setosa5.4,3.9,1.3,0.4,Iris-setosa5.1,3.5,1.4,0.3,Iris-setosa5.7,3.8,1.7,0.3,Iris-setosa5.1,3.8,1.5,0.3,Iris-setosa5.4,3.4,1.7,0.2,Iris-setosa5.1,3.7,1.5,0.4,Iris-setosa4.6,3.6,1.0,0.2,Iris-setosa5.1,3.3,1.7,0.5,Iris-setosa4.8,3.4,1.9,0.2,Iris-setosa5.0,3.0,1.6,0.2,Iris-setosa5.0,3.4,1.6,0.4,Iris-setosa5.2,3.5,1.5,0.2,Iris-setosa5.2,3.4,1.4,0.2,Iris-setosa4.7,3.2,1.6,0.2,Iris-setosa4.8,3.1,1.6,0.2,Iris-setosa5.4,3.4,1.5,0.4,Iris-setosa5.2,4.1,1.5,0.1,Iris-setosa5.5,4.2,1.4,0.2,Iris-setosa4.9,3.1,1.5,0.1,Iris-setosa5.0,3.2,1.2,0.2,Iris-setosa5.5,3.5,1.3,0.2,Iris-setosa4.9,3.1,1.5,0.1,Iris-setosa4.4,3.0,1.3,0.2,Iris-setosa5.1,3.4,1.5,0.2,Iris-setosa5.0,3.5,1.3,0.3,Iris-setosa4.5,2.3,1.3,0.3,Iris-setosa4.4,3.2,1.3,0.2,Iris-setosa5.0,3.5,1.6,0.6,Iris-setosa5.1,3.8,1.9,0.4,Iris-setosa4.8,3.0,1.4,0.3,Iris-setosa5.1,3.8,1.6,0.2,Iris-setosa4.6,3.2,1.4,0.2,Iris-setosa5.3,3.7,1.5,0.2,Iris-setosa5.0,3.3,1.4,0.2,Iris-setosa7.0,3.2,4.7,1.4,Iris-versicolor6.4,3.2,4.5,1.5,Iris-versicolor6.9,3.1,4.9,1.5,Iris-versicolor5.5,2.3,4.0,1.3,Iris-versicolor6.5,2.8,4.6,1.5,Iris-versicolor5.7,2.8,4.5,1.3,Iris-versicolor6.3,3.3,4.7,1.6,Iris-versicolor4.9,2.4,3.3,1.0,Iris-versicolor6.6,2.9,4.6,1.3,Iris-versicolor5.2,2.7,3.9,1.4,Iris-versicolor5.0,2.0,3.5,1.0,Iris-versicolor5.9,3.0,4.2,1.5,Iris-versicolor6.0,2.2,4.0,1.0,Iris-versicolor6.1,2.9,4.7,1.4,Iris-versicolor5.6,2.9,3.6,1.3,Iris-versicolor6.7,3.1,4.4,1.4,Iris-versicolor5.6,3.0,4.5,1.5,Iris-versicolor5.8,2.7,4.1,1.0,Iris-versicolor6.2,2.2,4.5,1.5,Iris-versicolor5.6,2.5,3.9,1.1,Iris-versicolor5.9,3.2,4.8,1.8,Iris-versicolor6.1,2.8,4.0,1.3,Iris-versicolor6.3,2.5,4.9,1.5,Iris-versicolor6.1,2.8,4.7,1.2,Iris-versicolor6.4,2.9,4.3,1.3,Iris-versicolor6.6,3.0,4.4,1.4,Iris-versicolor6.8,2.8,4.8,1.4,Iris-versicolor6.7,3.0,5.0,1.7,Iris-versicolor6.0,2.9,4.5,1.5,Iris-versicolor5.7,2.6,3.5,1.0,Iris-versicolor5.5,2.4,3.8,1.1,Iris-versicolor5.5,2.4,3.7,1.0,Iris-versicolor5.8,2.7,3.9,1.2,Iris-versicolor6.0,2.7,5.1,1.6,Iris-versicolor5.4,3.0,4.5,1.5,Iris-versicolor6.0,3.4,4.5,1.6,Iris-versicolor6.7,3.1,4.7,1.5,Iris-versicolor6.3,2.3,4.4,1.3,Iris-versicolor5.6,3.0,4.1,1.3,Iris-versicolor5.5,2.5,4.0,1.3,Iris-versicolor5.5,2.6,4.4,1.2,Iris-versicolor6.1,3.0,4.6,1.4,Iris-versicolor5.8,2.6,4.0,1.2,Iris-versicolor5.0,2.3,3.3,1.0,Iris-versicolor5.6,2.7,4.2,1.3,Iris-versicolor5.7,3.0,4.2,1.2,Iris-versicolor5.7,2.9,4.2,1.3,Iris-versicolor6.2,2.9,4.3,1.3,Iris-versicolor5.1,2.5,3.0,1.1,Iris-versicolor5.7,2.8,4.1,1.3,Iris-versicolor6.3,3.3,6.0,2.5,Iris-virginica5.8,2.7,5.1,1.9,Iris-virginica7.1,3.0,5.9,2.1,Iris-virginica6.3,2.9,5.6,1.8,Iris-virginica6.5,3.0,5.8,2.2,Iris-virginica7.6,3.0,6.6,2.1,Iris-virginica4.9,2.5,4.5,1.7,Iris-virginica7.3,2.9,6.3,1.8,Iris-virginica6.7,2.5,5.8,1.8,Iris-virginica7.2,3.6,6.1,2.5,Iris-virginica6.5,3.2,5.1,2.0,Iris-virginica6.4,2.7,5.3,1.9,Iris-virginica6.8,3.0,5.5,2.1,Iris-virginica5.7,2.5,5.0,2.0,Iris-virginica5.8,2.8,5.1,2.4,Iris-virginica6.4,3.2,5.3,2.3,Iris-virginica6.5,3.0,5.5,1.8,Iris-virginica7.7,3.8,6.7,2.2,Iris-virginica7.7,2.6,6.9,2.3,Iris-virginica6.0,2.2,5.0,1.5,Iris-virginica6.9,3.2,5.7,2.3,Iris-virginica5.6,2.8,4.9,2.0,Iris-virginica7.7,2.8,6.7,2.0,Iris-virginica6.3,2.7,4.9,1.8,Iris-virginica6.7,3.3,5.7,2.1,Iris-virginica7.2,3.2,6.0,1.8,Iris-virginica6.2,2.8,4.8,1.8,Iris-virginica6.1,3.0,4.9,1.8,Iris-virginica6.4,2.8,5.6,2.1,Iris-virginica7.2,3.0,5.8,1.6,Iris-virginica7.4,2.8,6.1,1.9,Iris-virginica7.9,3.8,6.4,2.0,Iris-virginica6.4,2.8,5.6,2.2,Iris-virginica6.3,2.8,5.1,1.5,Iris-virginica6.1,2.6,5.6,1.4,Iris-virginica7.7,3.0,6.1,2.3,Iris-virginica6.3,3.4,5.6,2.4,Iris-virginica6.4,3.1,5.5,1.8,Iris-virginica6.0,3.0,4.8,1.8,Iris-virginica6.9,3.1,5.4,2.1,Iris-virginica6.7,3.1,5.6,2.4,Iris-virginica6.9,3.1,5.1,2.3,Iris-virginica5.8,2.7,5.1,1.9,Iris-virginica6.8,3.2,5.9,2.3,Iris-virginica6.7,3.3,5.7,2.5,Iris-virginica6.7,3.0,5.2,2.3,Iris-virginica6.3,2.5,5.0,1.9,Iris-virginica6.5,3.0,5.2,2.0,Iris-virginica6.2,3.4,5.4,2.3,Iris-virginica5.9,3.0,5.1,1.8,Iris-virginica

鸢尾花数据集描述

1. Title: Iris Plants Database  Updated Sept 21 by C.Blake - Added discrepency information
2. Sources:     (a) Creator: R.A. Fisher     (b) Donor: Michael Marshall (MARSHALL%PLU@)     (c) Date: July, 1988
3. Past Usage:   - Publications: too many to mention!!!  Here are a few.   1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"      Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions      to Mathematical Statistics" (John Wiley, NY, 1950).   2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.      (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.   3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System      Structure and Classification Rule for Recognition in Partially Exposed      Environments".  IEEE Transactions on Pattern Analysis and Machine      Intelligence, Vol. PAMI-2, No. 1, 67-71.      -- Results:         -- very low misclassification rates (0% for the setosa class)   4. Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE       Transactions on Information Theory, May 1972, 431-433.      -- Results:         -- very low misclassification rates again   5. See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al's AUTOCLASS II      conceptual clustering system finds 3 classes in the data.
4. Relevant Information:   --- This is perhaps the best known database to be found in the pattern       recognition literature.  Fisher's paper is a classic in the field       and is referenced frequently to this day.  (See Duda & Hart, for       example.)  The data set contains 3 classes of 50 instances each,       where each class refers to a type of iris plant.  One class is       linearly separable from the other 2; the latter are NOT linearly       separable from each other.   --- Predicted attribute: class of iris plant.   --- This is an exceedingly simple domain.   --- This data differs from the data presented in Fishers article  (identified by Steve Chadwick,  spchadwick@ )  The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa"  where the error is in the fourth feature.  The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa"  where the errors are in the second and third features.  
5. Number of Instances: 150 (50 in each of three classes)
6. Number of Attributes: 4 numeric, predictive attributes and the class
7. Attribute Information:   1. sepal length in cm   2. sepal width in cm   3. petal length in cm   4. petal width in cm   5. class:       -- Iris Setosa      -- Iris Versicolour      -- Iris Virginica
8. Missing Attribute Values: None
Summary Statistics:           Min  Max   Mean    SD   Class Correlation   sepal length: 4.3  7.9   5.84  0.83    0.7826       sepal width: 2.0  4.4   3.05  0.43   -0.4194   petal length: 1.0  6.9   3.76  1.76    0.9490  (high!)    petal width: 0.1  2.5   1.20  0.76    0.9565  (high!)
9. Class Distribution: 33.3% for each of 3 classes.

XGBoost 不能按原样本对这个问题进行建模，因为它要求输出变量是数字变量。

我们可以使用LabelEncoder轻松地将字符串值转换为整数值。三个类值（Iris-setosa、Iris-versicolor、Iris-virginica）被映射到整数值（0、1、2）。

# encode string class values as integerslabel_encoder = LabelEncoder()label_encoder = label_encoder.fit(Y)label_encoded_y = label_encoder.transform(Y)

我们将标签编码器保存为一个单独的对象，以便我们可以使用相同的编码方案转换训练数据集以及随后的测试和验证数据集。

下面是一个完整的示例，演示如何加载 iris 数据集。请注意，Pandas 用于加载数据以处理字符串类值。

# multiclass classificationimport pandasimport xgboostfrom sklearn import model_selectionfrom sklearn.metrics import accuracy_scorefrom sklearn.preprocessing import LabelEncoder# load datadata = pandas.read_csv('iris.csv', header=None)dataset = data.values# split data into X and yX = dataset[:,0:4]Y = dataset[:,4]# encode string class values as integerslabel_encoder = LabelEncoder()label_encoder = label_encoder.fit(Y)label_encoded_y = label_encoder.transform(Y)seed = 7test_size = 0.33X_train, X_test, y_train, y_test = model_selection.train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)# fit model no training datamodel = xgboost.XGBClassifier()model.fit(X_train, y_train)print(model)# make predictions for test datay_pred = model.predict(X_test)predictions = [round(value) for value in y_pred]# evaluate predictionsaccuracy = accuracy_score(y_test, predictions)print("Accuracy: %.2f%%" % (accuracy * 100.0))

注意：计算机运行结果可能会因算法或评估程序的随机性或数值精度的差异而有所不同。考虑多次运行该示例并比较平均结果。

运行该示例会产生以下输出。

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,       objective='multi:softprob', reg_alpha=0, reg_lambda=1,       scale_pos_weight=1, seed=0, silent=True, subsample=1)Accuracy: 92.00%

请注意 XGBoost 模型如何配置为使用multi:softprob目标自动对多类分类问题进行建模，该目标是对类概率建模的 softmax 损失函数的变体。这表明在内部，输出类会自动转换为单热类型编码。

一种热编码分类数据

某些数据集仅包含分类数据，例如乳腺癌数据集。

该数据集描述了乳腺癌活检的技术细节，预测任务是预测患者是否有癌症复发。

下载数据集并将其放置在您当前的工作目录中，文件名为“ breast-cancer.csv ”。

乳腺癌数据集
乳腺癌数据集描述

以下是原始数据集的示例。

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events''50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events''50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events''40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events''40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'...

python机器学习-乳腺癌细胞挖掘
https://edu.51cto.com/sd/7036f

2万字阐述-Python 用 XGBoost 进行梯度提升的数据准备（收藏）_sklearn_02

我们可以看到所有 9 个输入变量都是分类的并以字符串格式描述。该问题是一个二元分类预测问题，输出类值也以字符串格式描述。

我们可以重用上一节中的相同方法，并将字符串类值转换为整数值，以使用 LabelEncoder 对预测进行建模。例如：

# encode string class values as integerslabel_encoder = LabelEncoder()label_encoder = label_encoder.fit(Y)label_encoded_y = label_encoder.transform(Y)

我们可以对 X 中的每个输入特征使用相同的方法，但这只是一个起点。

# encode string input values as integersfeatures = []for i in range(0, X.shape[1]):label_encoder = LabelEncoder()feature = label_encoder.fit_transform(X[:,i])features.append(feature)encoded_x = numpy.array(features)encoded_x = encoded_x.reshape(X.shape[0], X.shape[1])

XGBoost 可以假设每个输入变量的编码整数值具有序数关系。例如，对于乳房四边形变量，编码为 0 的 'left-up' 和编码为 1 的 'left-low' 具有作为整数的有意义的关系。在这种情况下，这个假设是不正确的。

相反，我们必须将这些整数值映射到新的二进制变量上，每个分类值对应一个新变量。

例如，乳房四边形变量具有以下值：

left-upleft-lowright-upright-lowcentral

我们可以将其建模为 5 个二元变量，如下所示：

left-up, left-low, right-up, right-low, central1,0,0,0,00,1,0,0,00,0,1,0,00,0,0,1,00,0,0,0,1

这称为一种热编码。我们可以使用scikit-learn 中的OneHotEncoder类对所有分类输入变量进行热编码。

我们可以在对每个特征进行标签编码后对其进行热编码。首先，我们必须将特征数组转换为二维 NumPy 数组，其中每个整数值都是一个长度为 1 的特征向量。

feature = feature.reshape(X.shape[0], 1)

然后我们可以创建 OneHotEncoder 并对特征数组进行编码。

onehot_encoder = OneHotEncoder(sparse=False, categories='auto')feature = onehot_encoder.fit_transform(feature)

最后，我们可以通过将一个热编码特征一个一个地连接起来，将它们添加为新列（轴 = 2）来构建输入数据集。我们最终得到一个由 43 个二进制输入变量组成的输入向量。

# encode string input values as integersencoded_x = Nonefor i in range(0, X.shape[1]):label_encoder = LabelEncoder()feature = label_encoder.fit_transform(X[:,i])feature = feature.reshape(X.shape[0], 1)onehot_encoder = OneHotEncoder(sparse=False, categories='auto')feature = onehot_encoder.fit_transform(feature)if encoded_x is None:encoded_x = featureelse:encoded_x = numpy.concatenate((encoded_x, feature), axis=1)print("X shape: : ", encoded_x.shape)

理想情况下，我们可以尝试不对某些输入属性进行热编码，因为我们可以使用明确的序数关系对它们进行编码，例如，第一列年龄的值类似于 '40-49' 和 '50-59'。如果您有兴趣扩展此示例，则将其留作练习。

下面是带有标签和一个热编码输入变量和标签编码输出变量的完整示例。

# binary classification, breast cancer dataset, label and one hot encodedimport numpyfrom pandas import read_csvfrom xgboost import XGBClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.preprocessing import LabelEncoderfrom sklearn.preprocessing import OneHotEncoder# load datadata = read_csv('breast-cancer.csv', header=None)dataset = data.values# split data into X and yX = dataset[:,0:9]X = X.astype(str)Y = dataset[:,9]# encode string input values as integersencoded_x = Nonefor i in range(0, X.shape[1]):label_encoder = LabelEncoder()feature = label_encoder.fit_transform(X[:,i])feature = feature.reshape(X.shape[0], 1)onehot_encoder = OneHotEncoder(sparse=False, categories='auto')feature = onehot_encoder.fit_transform(feature)if encoded_x is None:encoded_x = featureelse:encoded_x = numpy.concatenate((encoded_x, feature), axis=1)print("X shape: : ", encoded_x.shape)# encode string class values as integerslabel_encoder = LabelEncoder()label_encoder = label_encoder.fit(Y)label_encoded_y = label_encoder.transform(Y)# split data into train and test setsseed = 7test_size = 0.33X_train, X_test, y_train, y_test = train_test_split(encoded_x, label_encoded_y, test_size=test_size, random_state=seed)# fit model no training datamodel = XGBClassifier()model.fit(X_train, y_train)print(model)# make predictions for test datay_pred = model.predict(X_test)predictions = [round(value) for value in y_pred]# evaluate predictionsaccuracy = accuracy_score(y_test, predictions)print("Accuracy: %.2f%%" % (accuracy * 100.0))

注意：您的结果可能会因算法或评估程序的随机性或数值精度的差异而有所不同。考虑多次运行该示例并比较平均结果。

运行这个例子，我们得到以下输出。

('X shape: : ', (285, 43))XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,       objective='binary:logistic', reg_alpha=0, reg_lambda=1,       scale_pos_weight=1, seed=0, silent=True, subsample=1)Accuracy: 71.58%

我们再次可以看到 XGBoost 框架自动选择了 ' binary:logistic ' 目标，这是这个二元分类问题的正确目标。

支持缺失数据

XGBoost 可以自动学习如何最好地处理丢失的数据。

事实上，XGBoost 旨在处理稀疏数据，就像上一节中的一个热编码数据一样，通过最小化损失函数，处理缺失数据的方式与处理稀疏或零值的方式相同。

有关如何在 XGBoost 中处理缺失值的技术细节的更多信息，请参阅论文XGBoost：A Scalable Tree Boosting System 中的第 3.4 节“稀疏感知拆分查找” 。

Horse Colic 数据集是展示这种能力的一个很好的例子，因为它包含很大比例的缺失数据，大约 30%。

下载数据集并将其放置在您当前的工作目录中，文件名为“ horse-colic.csv ”。

马绞痛数据集
马绞痛数据集描述

1. TItle: Horse Colic database
2. Source Information   -- Creators: Mary McLeish & Matt Cecile      Department of Computer Science    University of Guelph    Guelph, Ontario, Canada N1G 2W1    mdmcleish@   -- Donor:    Will Taylor (taylor@)   -- Date:     8/6/89
3. Past Usage:   -- Unknown
4. Relevant Information:
   -- 2 data files      -- horse-colic.data: 300 training instances      -- horse-colic.test: 68 test instances   -- Possible class attributes: 24 (whether lesion is surgical)     -- others include: 23, 25, 26, and 27   -- Many Data types: (continuous, discrete, and nominal)
5. Number of Instances: 368 (300 for training, 68 for testing)
6. Number of attributes: 28
7. Attribute Information:
  1:  surgery?          1 = Yes, it had surgery          2 = It was treated without surgery
  2:  Age          1 = Adult horse          2 = Young (< 6 months)
  3:  Hospital Number          - numeric id          - the case number assigned to the horse            (may not be unique if the horse is treated > 1 time)
  4:  rectal temperature          - linear          - in degrees celsius.          - An elevated temp may occur due to infection.          - temperature may be reduced when the animal is in late shock          - normal temp is 37.8          - this parameter will usually change as the problem progresses               eg. may start out normal, then become elevated because of                   the lesion, passing back through the normal range as the                   horse goes into shock  5:  pulse          - linear          - the heart rate in beats per minute          - is a reflection of the heart condition: 30 -40 is normal for adults          - rare to have a lower than normal rate although athletic horses            may have a rate of 20-25          - animals with painful lesions or suffering from circulatory shock            may have an elevated heart rate
  6:  respiratory rate          - linear          - normal rate is 8 to 10          - usefulness is doubtful due to the great fluctuations
  7:  temperature of extremities          - a subjective indication of peripheral circulation          - possible values:               1 = Normal               2 = Warm               3 = Cool               4 = Cold          - cool to cold extremities indicate possible shock          - hot extremities should correlate with an elevated rectal temp.
  8:  peripheral pulse          - subjective          - possible values are:               1 = normal               2 = increased               3 = reduced               4 = absent          - normal or increased p.p. are indicative of adequate circulation            while reduced or absent indicate poor perfusion
  9:  mucous membranes          - a subjective measurement of colour          - possible values are:               1 = normal pink               2 = bright pink               3 = pale pink               4 = pale cyanotic               5 = bright red / injected               6 = dark cyanotic          - 1 and 2 probably indicate a normal or slightly increased            circulation          - 3 may occur in early shock          - 4 and 6 are indicative of serious circulatory compromise          - 5 is more indicative of a septicemia
 10: capillary refill time          - a clinical judgement. The longer the refill, the poorer the            circulation          - possible values               1 = < 3 seconds               2 = >= 3 seconds
 11: pain - a subjective judgement of the horse's pain level          - possible values:               1 = alert, no pain               2 = depressed               3 = intermittent mild pain               4 = intermittent severe pain               5 = continuous severe pain          - should NOT be treated as a ordered or discrete variable!          - In general, the more painful, the more likely it is to require            surgery          - prior treatment of pain may mask the pain level to some extent
 12: peristalsis          - an indication of the activity in the horse's gut. As the gut            becomes more distended or the horse becomes more toxic, the            activity decreases          - possible values:               1 = hypermotile               2 = normal               3 = hypomotile               4 = absent
 13: abdominal distension          - An IMPORTANT parameter.          - possible values               1 = none               2 = slight               3 = moderate               4 = severe          - an animal with abdominal distension is likely to be painful and            have reduced gut motility.          - a horse with severe abdominal distension is likely to require            surgery just tio relieve the pressure
 14: nasogastric tube          - this refers to any gas coming out of the tube          - possible values:               1 = none               2 = slight               3 = significant          - a large gas cap in the stomach is likely to give the horse            discomfort
 15: nasogastric reflux          - possible values               1 = none               2 = > 1 liter               3 = < 1 liter          - the greater amount of reflux, the more likelihood that there is            some serious obstruction to the fluid passage from the rest of            the intestine
 16: nasogastric reflux PH          - linear          - scale is from 0 to 14 with 7 being neutral          - normal values are in the 3 to 4 range
 17: rectal examination - feces          - possible values               1 = normal               2 = increased               3 = decreased               4 = absent          - absent feces probably indicates an obstruction
 18: abdomen          - possible values               1 = normal               2 = other               3 = firm feces in the large intestine               4 = distended small intestine               5 = distended large intestine          - 3 is probably an obstruction caused by a mechanical impaction            and is normally treated medically          - 4 and 5 indicate a surgical lesion
 19: packed cell volume          - linear          - the # of red cells by volume in the blood          - normal range is 30 to 50. The level rises as the circulation            becomes compromised or as the animal becomes dehydrated.
 20: total protein          - linear          - normal values lie in the 6-7.5 (gms/dL) range          - the higher the value the greater the dehydration
 21: abdominocentesis appearance          - a needle is put in the horse's abdomen and fluid is obtained from            the abdominal cavity          - possible values:               1 = clear               2 = cloudy               3 = serosanguinous          - normal fluid is clear while cloudy or serosanguinous indicates            a compromised gut
 22: abdomcentesis total protein          - linear          - the higher the level of protein the more likely it is to have a            compromised gut. Values are in gms/dL
 23: outcome          - what eventually happened to the horse?          - possible values:               1 = lived               2 = died               3 = was euthanized
 24: surgical lesion?          - retrospectively, was the problem (lesion) surgical?          - all cases are either operated upon or autopsied so that            this value and the lesion type are always known          - possible values:               1 = Yes               2 = No
 25, 26, 27: type of lesion          - first number is site of lesion               1 = gastric               2 = sm intestine               3 = lg colon               4 = lg colon and cecum               5 = cecum               6 = transverse colon               7 = retum/descending colon               8 = uterus               9 = bladder               11 = all intestinal sites               00 = none          - second number is type               1 = simple               2 = strangulation               3 = inflammation               4 = other          - third number is subtype               1 = mechanical               2 = paralytic               0 = n/a          - fourth number is specific code               1 = obturation               2 = intrinsic               3 = extrinsic               4 = adynamic               5 = volvulus/torsion               6 = intussuption               7 = thromboembolic               8 = hernia               9 = lipoma/slenic incarceration               10 = displacement               0 = n/a 28: cp_data          - is pathology data present for this case?               1 = Yes               2 = No          - this variable is of no significance since pathology data            is not included or collected for these cases
8. Missing values: 30% of the values are missing

以下是原始数据集的示例。

2 1 530101 38.50 66 28 3 3 ? 2 5 4 4 ? ? ? 3 5 45.00 8.40 ? ? 2 2 11300 00000 00000 21 1 534817 39.2 88 20 ? ? 4 1 3 4 2 ? ? ? 4 2 50 85 2 2 3 2 02208 00000 00000 22 1 530334 38.30 40 24 1 1 3 1 3 3 1 ? ? ? 1 1 33.00 6.70 ? ? 1 2 00000 00000 00000 11 9 5290409 39.10 164 84 4 1 6 2 2 4 4 1 2 5.00 3 ? 48.00 7.20 3 5.30 2 1 02208 00000 00000 12 1 530255 37.30 104 35 ? ? 6 2 ? ? ? ? ? ? ? ? 74.00 7.40 ? ? 2 2 04300 00000 00000 2...

这些值由空格分隔，我们可以使用 Pandas 函数read_csv轻松加载它。

dataframe = read_csv("horse-colic.csv", delim_whitespace=True, header=None)

加载后，我们可以看到缺失的数据用问号字符 ('?') 标记。我们可以将这些缺失值更改为 XGBoost 期望的稀疏值，即值零 (0)。

# set missing values to 0X[X == '?'] = 0

因为缺失数据被标记为字符串，所以那些缺失数据的列都被加载为字符串数据类型。我们现在可以将整个输入数据集转换为数值。

# convert to numericX = X.astype('float32')

最后，尽管类值用整数 1 和 2 标记，但这是一个二元分类问题。我们在 XGBoost 中将二元分类问题建模为逻辑 0 和 1 值。我们可以使用 LabelEncoder 轻松地将 Y 数据集转换为 0 和 1 整数，就像我们在鸢尾花示例中所做的那样。

# encode Y class values as integerslabel_encoder = LabelEncoder()label_encoder = label_encoder.fit(Y)label_encoded_y = label_encoder.transform(Y)

为了完整起见，下面提供了完整的代码清单。

# binary classification, missing datafrom pandas import read_csvfrom xgboost import XGBClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.preprocessing import LabelEncoder# load datadataframe = read_csv("horse-colic.csv", delim_whitespace=True, header=None)dataset = dataframe.values# split data into X and yX = dataset[:,0:27]Y = dataset[:,27]# set missing values to 0X[X == '?'] = 0# convert to numericX = X.astype('float32')# encode Y class values as integerslabel_encoder = LabelEncoder()label_encoder = label_encoder.fit(Y)label_encoded_y = label_encoder.transform(Y)# split data into train and test setsseed = 7test_size = 0.33X_train, X_test, y_train, y_test = train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)# fit model no training datamodel = XGBClassifier()model.fit(X_train, y_train)print(model)# make predictions for test datay_pred = model.predict(X_test)predictions = [round(value) for value in y_pred]# evaluate predictionsaccuracy = accuracy_score(y_test, predictions)print("Accuracy: %.2f%%" % (accuracy * 100.0))

注意：您的结果可能会因算法或评估程序的随机性或数值精度的差异而有所不同。考虑多次运行该示例并比较平均结果。

运行此示例会产生以下输出。

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,       objective='binary:logistic', reg_alpha=0, reg_lambda=1,       scale_pos_weight=1, seed=0, silent=True, subsample=1)Accuracy: 83.84%

我们可以通过将缺失值标记为非零值（例如 1）来梳理 XGBoost 自动处理缺失值的效果。

注意：您的结果可能会因算法或评估程序的随机性或数值精度的差异而有所不同。考虑多次运行该示例并比较平均结果。

重新运行示例表明模型的准确度下降。

我们也可以用一个特定的值来估算缺失的数据。

通常对列使用平均值或中位数。我们可以使用 scikit-learn SimpleImputer 类轻松估算缺失的数据。

# impute missing values as the meanimputer = SimpleImputer()imputed_x = imputer.fit_transform(X)

下面是完整示例，其中缺失数据使用每列的平均值进行插补。

# binary classification, missing data, impute with meanimport numpyfrom pandas import read_csvfrom xgboost import XGBClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.preprocessing import LabelEncoderfrom sklearn.impute import SimpleImputer# load datadataframe = read_csv("horse-colic.csv", delim_whitespace=True, header=None)dataset = dataframe.values# split data into X and yX = dataset[:,0:27]Y = dataset[:,27]# set missing values to 0X[X == '?'] = numpy.nan# convert to numericX = X.astype('float32')# impute missing values as the meanimputer = SimpleImputer()imputed_x = imputer.fit_transform(X)# encode Y class values as integerslabel_encoder = LabelEncoder()label_encoder = label_encoder.fit(Y)label_encoded_y = label_encoder.transform(Y)# split data into train and test setsseed = 7test_size = 0.33X_train, X_test, y_train, y_test = train_test_split(imputed_x, label_encoded_y, test_size=test_size, random_state=seed)# fit model no training datamodel = XGBClassifier()model.fit(X_train, y_train)print(model)# make predictions for test datay_pred = model.predict(X_test)predictions = [round(value) for value in y_pred]# evaluate predictionsaccuracy = accuracy_score(y_test, predictions)print("Accuracy: %.2f%%" % (accuracy * 100.0))