传统的交叉验证
在做机器学习时,交叉验证时验证模型稳定性的重要手段。大部分交叉验证仅分为训练集和测试集,每次循环一次,直至覆盖所有数据。这个可以采用sklern中cross_validiation_score 直接运行,如:
该方法的缺点之一是只能拆分成训练集和测试集,当需要验证集来调参数时,无验证集可用。
先划分测试集,但不循环
实现训练集,测试集,和验证集的一个方法时在训练数据前分割数据,取出一部分数据作为测试集,然后再采用上述的k—折交叉验证。这可以采用sklearn 中的train_test_split()划分后采用cross_validation_score()来运行。如:
如何实现训练集,测试集,验证集的循环
该方法的缺点是测试集无法循环,那么如何实现训练集,验证集,测试集在整个数据中的循环?如下图所示
我们可以采用numpy 数组索引切除的方法手动进行,代码如下:
import numpy as np
import pandas as pd
# 建立一个数组
>>> data=np.array(np.arange(0,20*3).reshape(20,3))/3
>>> data=np.around(data,decimals=1)
>>> data
array([[ 0. , 0.3, 0.7],
[ 1. , 1.3, 1.7],
[ 2. , 2.3, 2.7],
[ 3. , 3.3, 3.7],
[ 4. , 4.3, 4.7],
[ 5. , 5.3, 5.7],
[ 6. , 6.3, 6.7],
[ 7. , 7.3, 7.7],
[ 8. , 8.3, 8.7],
[ 9. , 9.3, 9.7],
[10. , 10.3, 10.7],
[11. , 11.3, 11.7],
[12. , 12.3, 12.7],
[13. , 13.3, 13.7],
[14. , 14.3, 14.7],
[15. , 15.3, 15.7],
[16. , 16.3, 16.7],
[17. , 17.3, 17.7],
[18. , 18.3, 18.7],
[19. , 19.3, 19.7]])
# 建立label
>>>y=np.arange(20)
>>>y
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
# 获得数据量大小的值
>>> idx=data.shape[0]
>>> idx
20
开始在数据中循环,这儿设置为5组交叉验证
for i in range(5): # 10折的话把5改成10,下同
print(i,"times: ")
test=data[int(idx*i*0.2):int(idx*(i+1)*0.2),] # 测试集,如果10折的话把0.2改成0.1,下同
test_y=y[int(idx*i*0.2):int(idx*(i+1)*0.2),] # 测试集合标签
if i+1 <= max(range(5)): # 10折的话把5改成10,下同
val=data[int(idx*(i+1)*0.2):int((i+2)*idx*0.2)] # 验证集,
val_y=y[int(idx*(i+1)*0.2):int((i+2)*idx*0.2)] # 验证集标签
train=np.delete(data,range(int(idx*i*0.2),int(idx*(i+2)*0.2)),axis=0) # 训练集,去掉测试集和验证集剩下部分
train_y=np.delete(y,range(int(idx*i*0.2),int(idx*(i+2)*0.2)),axis=0) # 训练集标签
#train=np.delete(train,range(int(idx*(i+1)*0.2),int((i+2)*idx*0.2)),axis=0)
else: # 最后一个循环:当最后一组为测试集,第一组为验证集,中间为训练集时
val=data[:int(((i+1)%4)*idx*0.2)] # 10折的话把4改成8,0.2改成0.1,下同
val_y=y[:int(((i+1)%4)*idx*0.2)]
train=np.delete(data,range(int(idx*i*0.2),int(idx*(i+1)*0.2)),axis=0)
train=np.delete(train,range(int(((i+1)%4)*idx*0.2)),axis=0)
train_y=np.delete(y,range(int(idx*i*0.2),int(idx*(i+1)*0.2)),axis=0)
train_y=np.delete(train_y,range(int(((i+1)%4)*idx*0.2)),axis=0)
print("test:\n",test,"-----test_y",test_y)
print("val:\n", val,"-----val_y",val_y)
print("train\n", train,"---------train_y", train_y)
print("---------------------------------------")
结果如下:
0 times:
test:
tf.Tensor(
[[0. 0.3 0.7]
[1. 1.3 1.7]
[2. 2.3 2.7]
[3. 3.3 3.7]], shape=(4, 3), dtype=float32) -----test_y tf.Tensor([0 1 2 3], shape=(4,), dtype=int32)
val:
tf.Tensor(
[[4. 4.3 4.7]
[5. 5.3 5.7]
[6. 6.3 6.7]
[7. 7.3 7.7]], shape=(4, 3), dtype=float32) -----val_y tf.Tensor([4 5 6 7], shape=(4,), dtype=int32)
train
[[ 8. 8.3 8.7]
[ 9. 9.3 9.7]
[10. 10.3 10.7]
[11. 11.3 11.7]
[12. 12.3 12.7]
[13. 13.3 13.7]
[14. 14.3 14.7]
[15. 15.3 15.7]
[16. 16.3 16.7]
[17. 17.3 17.7]
[18. 18.3 18.7]
[19. 19.3 19.7]] ---------train_y [ 8 9 10 11 12 13 14 15 16 17 18 19]
---------------------------------------
1 times:
test:
tf.Tensor(
[[4. 4.3 4.7]
[5. 5.3 5.7]
[6. 6.3 6.7]
[7. 7.3 7.7]], shape=(4, 3), dtype=float32) -----test_y tf.Tensor([4 5 6 7], shape=(4,), dtype=int32)
val:
tf.Tensor(
[[ 8. 8.3 8.7]
[ 9. 9.3 9.7]
[10. 10.3 10.7]
[11. 11.3 11.7]], shape=(4, 3), dtype=float32) -----val_y tf.Tensor([ 8 9 10 11], shape=(4,), dtype=int32)
train
[[ 0. 0.3 0.7]
[ 1. 1.3 1.7]
[ 2. 2.3 2.7]
[ 3. 3.3 3.7]
[12. 12.3 12.7]
[13. 13.3 13.7]
[14. 14.3 14.7]
[15. 15.3 15.7]
[16. 16.3 16.7]
[17. 17.3 17.7]
[18. 18.3 18.7]
[19. 19.3 19.7]] ---------train_y [ 0 1 2 3 12 13 14 15 16 17 18 19]
---------------------------------------
2 times:
test:
tf.Tensor(
[[ 8. 8.3 8.7]
[ 9. 9.3 9.7]
[10. 10.3 10.7]
[11. 11.3 11.7]], shape=(4, 3), dtype=float32) -----test_y tf.Tensor([ 8 9 10 11], shape=(4,), dtype=int32)
val:
tf.Tensor(
[[12. 12.3 12.7]
[13. 13.3 13.7]
[14. 14.3 14.7]
[15. 15.3 15.7]], shape=(4, 3), dtype=float32) -----val_y tf.Tensor([12 13 14 15], shape=(4,), dtype=int32)
train
[[ 0. 0.3 0.7]
[ 1. 1.3 1.7]
[ 2. 2.3 2.7]
[ 3. 3.3 3.7]
[ 4. 4.3 4.7]
[ 5. 5.3 5.7]
[ 6. 6.3 6.7]
[ 7. 7.3 7.7]
[16. 16.3 16.7]
[17. 17.3 17.7]
[18. 18.3 18.7]
[19. 19.3 19.7]] ---------train_y [ 0 1 2 3 4 5 6 7 16 17 18 19]
---------------------------------------
3 times:
test:
tf.Tensor(
[[12. 12.3 12.7]
[13. 13.3 13.7]
[14. 14.3 14.7]
[15. 15.3 15.7]], shape=(4, 3), dtype=float32) -----test_y tf.Tensor([12 13 14 15], shape=(4,), dtype=int32)
val:
tf.Tensor(
[[16. 16.3 16.7]
[17. 17.3 17.7]
[18. 18.3 18.7]
[19. 19.3 19.7]], shape=(4, 3), dtype=float32) -----val_y tf.Tensor([16 17 18 19], shape=(4,), dtype=int32)
train
[[ 0. 0.3 0.7]
[ 1. 1.3 1.7]
[ 2. 2.3 2.7]
[ 3. 3.3 3.7]
[ 4. 4.3 4.7]
[ 5. 5.3 5.7]
[ 6. 6.3 6.7]
[ 7. 7.3 7.7]
[ 8. 8.3 8.7]
[ 9. 9.3 9.7]
[10. 10.3 10.7]
[11. 11.3 11.7]] ---------train_y [ 0 1 2 3 4 5 6 7 8 9 10 11]
---------------------------------------
4 times:
test:
tf.Tensor(
[[16. 16.3 16.7]
[17. 17.3 17.7]
[18. 18.3 18.7]
[19. 19.3 19.7]], shape=(4, 3), dtype=float32) -----test_y tf.Tensor([16 17 18 19], shape=(4,), dtype=int32)
val:
tf.Tensor(
[[0. 0.3 0.7]
[1. 1.3 1.7]
[2. 2.3 2.7]
[3. 3.3 3.7]], shape=(4, 3), dtype=float32) -----val_y tf.Tensor([0 1 2 3], shape=(4,), dtype=int32)
train
[[ 4. 4.3 4.7]
[ 5. 5.3 5.7]
[ 6. 6.3 6.7]
[ 7. 7.3 7.7]
[ 8. 8.3 8.7]
[ 9. 9.3 9.7]
[10. 10.3 10.7]
[11. 11.3 11.7]
[12. 12.3 12.7]
[13. 13.3 13.7]
[14. 14.3 14.7]
[15. 15.3 15.7]] ---------train_y [ 4 5 6 7 8 9 10 11 12 13 14 15]
---------------------------------------
可见数据在各组之间循环,而且对应的label也都是对应循环的
这儿是设置的5组,如果是10组的话,把代码中range(5)直接改成range(10),然后把0.2改成0.1,4改成8即可。
这个代码不完美,只是能实现三者循环,有更好的代码可以分享给我