来自 KFold 拆分索引的实际数据

Question

假设我有以下数据：

y = np.ones(10)
y[-5:] = 0
X = pd.DataFrame({'a':np.random.randint(10,20, size=(10)),
                  'b':np.random.randint(80,90, size=(10))})
X    
    a   b
0   11  82
1   19  82
2   15  80
3   15  86
4   14  82
5   18  87
6   13  83
7   12  83
8   10  82
9   18  87

将其拆分为 5 倍可得出以下指数：

kf =  KFold()
data = list(kf.split(X,y))
data
[(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1])),
 (array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3])),
 (array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7])),
 (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))]

但我想进一步准备 data 以使其包含以下格式的实际值：

data =
   [(train1,trainlabel1,test1,testlabel1),
    (train2,trainlabel2,test2,testlabel2),
     ..,
    (train5,trainlabel5,test5,testlabel5)]

预期输出（来自给定的 MWE）：

[array([
        (array([[15,80],[15,86],[14,82],[18,87],[13,83],[12,83],[10,82],[18,87]]), array([[1],[1],[1],[0],[0],[0],[0],[0])]), #fold1 train/label
        (array([[11,82],[19,82]]), array([[1],[1]])),  #fold1 test/label

        (array([[11,82],[19,82],[14,82],[18,87],[13,83],[12,83],[10,82],[18,87]]),array([[1],[1],[1],[0],[0],[0],[0],[0]])), #fold2 train/label
        (array([[15,80],[15,86]]),array([[1],[1]])) #fold2 test/label

        ....
])]

Answer 1

如您所知，KFold().split(data) returns select 指数。要 select Pandas.DataFrame rows with indices 列表，最简单的方法是 loc method.

for train_idx, test_idx in KFold(n_splits=2).split(X):
   x_train = X.loc[train_idx]
   x_test = X.loc[test_idx]

   y_train = y.loc[train_idx]
   y_test = y.loc[test_idx]

然后您可以将子集数据帧添加到列表

Answer 2

其实@hotuagia的回答是正确的。您收到此错误是因为您尝试访问 y 的元素，这是一个使用 loc 的数组元素，loc 是一个数据框属性。一个方便的方法是在传递给 KFold.

之前将 y 转换为 pandas Dataframe 或 Series

所以：

y = np.ones(10) 
y[-5:] = 0
X = pd.DataFrame({'a':np.random.randint(10,20, size=(10)),
                  'b':np.random.randint(80,90, size=(10))})
# y- array to pandas df or series
y = pd.DataFrame(y) # or pd.Series(y)

然后继续@hotuagia的回答：

for train_idx, test_idx in KFold(n_splits=2).split(X):
   x_train = X.loc[train_idx]
   x_test = X.loc[test_idx]

   y_train = y.loc[train_idx]
   y_test = y.loc[test_idx]

来自 KFold 拆分索引的实际数据

Actual data from KFold split indices

python

machine-learning

scikit-learn

cross-validation

k-fold