学习笔记,仅供参考,有错必纠

PS : 本BLOG采用中英混合模式


非线性回归模型



k近邻



The KNN approach simply predicts a new sample using the K -closest samples from the training set.

KNN cannot be cleanly summarized by a model.Instead, its construction is solely based on the individual samples from the training data. (KNN没有一个简单的模型表达式,相反,它的建立是基于训练集中每一个单独的样本点)

To predict a new sample for regression, KNN identifies that sample’s KNNs in the predictor space. The predicted response for the new sample is then the mean of the K neighbors’ responses. Other summary statistics, such as the median, can also be used in place of the mean to predict the new sample.

The basic KNN method as described above depends on how the user defines distance between samples(用户如何定义样本点之间的距离). Euclidean distance(欧氏距离) is the most commonly used metric and is defined as follows:

where and are two individual samples. Minkowski distance(闵可夫斯基距离) is a generalization of Euclidean distance(欧氏距离的推广) and is defined as:

where q > 0 (其中). It is easy to see that when q = 2, then Minkowski distance is the same as Euclidean distance. When q = 1, then Minkowski distance is equivalent to Manhattan distance(曼哈顿距离) , which is a common metric used for samples with binary predictors(二元预测变量).

Because the KNN method fundamentally depends on distance between samples, the scale of the predictors can have a dramatic influence on the distances among samples. (预测变量的标度会极大影响距离的取值)

Data with predictors that are on vastly different scales will generate distances that are weighted towards predictors that have the largest scales.(当数据预测变量的标度相差很大时,具有最大标度的预测变量将会在整体的距离中占据很大权重)

That is, predictors with the largest scales will contribute most to the distance between samples.To avoid this potential bias and to enable each predictor to contribute equally to the distance calculation, we recommend that all predictors be centered and scaled prior to performing KNN(所有预测变量在KNN建模之前,进行中心化和标准化).

In addition to the issue of scaling, using distances between samples can be problematic if one or more of the predictor values for a sample is missing(1个或多个预测变量存在缺失值), since it is then not possible to compute the distance between samples.

First, either the samples or the predictors can be excluded from the analysis.

If a predictor contains a sufficient amount of information across the samples(如果一个预测变量在样本中包含了足够多的信息), then an alternative approach is to impute the missing data using a naive estimator(朴素贝叶斯评估器) such as the mean of the predictor(预测变量的均值), or a nearest neighbor approach that uses only the predictors with complete information(或者利用有完整信息的预测变量计算最近邻)

Upon pre-processing the data and selecting the distance metric, the next step is to find the optimal number of neighbors(最优的近邻数). Like tuning parameters from other models, K can be determined by resampling(重抽样).

需要注意的是,较小的k会导致过拟合,较大的k则会导致拟合不足。在下图中RMSE随着K的增加先快速下降,之后平稳,最后缓慢上升,这种模式的概览图对于KNN模型而言是很典型的:

非线性回归模型(part3)--K近邻_ide_06



The elementary version of KNN is intuitive and straightforward and can produce decent predictions, especially when the response is dependent on the local predictor structure.

However, this version does have some notable problems(很显著的问题), of which researchers have sought solutions. Two commonly noted problems are computational time (计算时间)and the disconnect between local structure and the predictive ability of KNN(局部结构与KNN预测能力之间的联系可能失效).

对于计算时间的问题,我们可以使用k维树(或称为k-d树)来解决。

A k-d tree orthogonally partitions the predictor space(正交的划分预测空间) using a tree approach.After the tree has been grown, a new sample is placed through the structure. Distances are only computed for those training observations in the tree that are close to the new sample.(只有那些靠近新样本的训练集观测需要计算距离)

当预测变量的局部结构与响应变量不相关时,KNN可能会有很差的预测效果。不相关或者包含噪声的预测变量是一大隐患,这是因为它们会使得相近的样本点在预测变量空间中相互远离。

Hence, removing irrelevant, noise-laden predictors is a key pre-processing step for KNN.

Another approach to enhancing KNN predictivity is to weight the neighbors’ contribution to the prediction of a new sample based on their distance to the new sample.In this variation, training samples that are closer to the new sample contribute more to the predicted response, while those that are farther away contribute less to the predicted response.