PS : 本BLOG采用中英混合模式
The KNN approach simply predicts a new sample using the K -closest samples from the training set.
KNN cannot be cleanly summarized by a model.Instead, its construction is solely based on the individual samples from the training data. (KNN没有一个简单的模型表达式，相反，它的建立是基于训练集中每一个单独的样本点)
To predict a new sample for regression, KNN identiﬁes that sample’s KNNs in the predictor space. The predicted response for the new sample is then the mean of the K neighbors’ responses. Other summary statistics, such as the median, can also be used in place of the mean to predict the new sample.
The basic KNN method as described above depends on how the user deﬁnes distance between samples(用户如何定义样本点之间的距离). Euclidean distance(欧氏距离) is the most commonly used metric and is deﬁned as follows:
where and are two individual samples. Minkowski distance(闵可夫斯基距离) is a generalization of Euclidean distance(欧氏距离的推广) and is deﬁned as:
where q > 0 (其中). It is easy to see that when q = 2, then Minkowski distance is the same as Euclidean distance. When q = 1, then Minkowski distance is equivalent to Manhattan distance(曼哈顿距离) , which is a common metric used for samples with binary predictors(二元预测变量).
Because the KNN method fundamentally depends on distance between samples, the scale of the predictors can have a dramatic inﬂuence on the distances among samples. (预测变量的标度会极大影响距离的取值)
Data with predictors that are on vastly diﬀerent scales will generate distances that are weighted towards predictors that have the largest scales.(当数据预测变量的标度相差很大时，具有最大标度的预测变量将会在整体的距离中占据很大权重)
That is, predictors with the largest scales will contribute most to the distance between samples.To avoid this potential bias and to enable each predictor to contribute equally to the distance calculation, we recommend that all predictors be centered and scaled prior to performing KNN(所有预测变量在KNN建模之前，进行中心化和标准化).
In addition to the issue of scaling, using distances between samples can be problematic if one or more of the predictor values for a sample is missing(1个或多个预测变量存在缺失值), since it is then not possible to compute the distance between samples.
First, either the samples or the predictors can be excluded from the analysis.
If a predictor contains a suﬃcient amount of information across the samples(如果一个预测变量在样本中包含了足够多的信息), then an alternative approach is to impute the missing data using a naive estimator(朴素贝叶斯评估器) such as the mean of the predictor(预测变量的均值), or a nearest neighbor approach that uses only the predictors with complete information(或者利用有完整信息的预测变量计算最近邻)
Upon pre-processing the data and selecting the distance metric, the next step is to ﬁnd the optimal number of neighbors(最优的近邻数). Like tuning parameters from other models, K can be determined by resampling(重抽样).
The elementary version of KNN is intuitive and straightforward and can produce decent predictions, especially when the response is dependent on the local predictor structure.
However, this version does have some notable problems(很显著的问题), of which researchers have sought solutions. Two commonly noted problems are computational time (计算时间)and the disconnect between local structure and the predictive ability of KNN(局部结构与KNN预测能力之间的联系可能失效).
A k-d tree orthogonally partitions the predictor space(正交的划分预测空间) using a tree approach.After the tree has been grown, a new sample is placed through the structure. Distances are only computed for those training observations in the tree that are close to the new sample.(只有那些靠近新样本的训练集观测需要计算距离)
Hence, removing irrelevant, noise-laden predictors is a key pre-processing step for KNN.
Another approach to enhancing KNN predictivity is to weight the neighbors’ contribution to the prediction of a new sample based on their distance to the new sample.In this variation, training samples that are closer to the new sample contribute more to the predicted response, while those that are farther away contribute less to the predicted response.