Data leakage
存在一种现象是:model的training和validation score都很高,但是用来预测新的实际数据时,准确率很低。
这可能是因为predictors里面蕴含了target的信息 --- share information between test and training data sets。
training和validation分开的目的,就是尽量想让validation data模拟invisible real-world data..
- Leaky predictors
- data not available when predict.. (比如肺炎预测的例子里,预测新的人有没有患肺炎时,并不能知道其抗生素服用,因为抗生素服用是after 肺炎的)
- overfitting
- Leaky validation strategy
- 比如validation影响到preprocessing时..
第一种是“因果颠倒”。
第二种是没有分清training和validation set,this happens if you run preprocessing (like fitting the Imputer for missing values) before calling train_test_split。
怎么发现data leakage?
- 关注predictor和target的statistical correlation - 相关性非常高时,要注意是不是存在leakage;
- model准确率非常高时..
怎么防止data leakage
- If your validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of preprocessing steps. This is easier if you use scikit-learn Pipelines. When using cross-validation, it's even more critical that you use pipelines and do your preprocessing inside the pipeline.
- drop掉任何created/updated after target的predictors..