sklearn loocv.split 返回比预期更小的测试和训练数组

Question

因为我有一个小数据集，所以我在 sklearn 中使用 LOOCV（留一交叉验证）。

当我运行我的分类器时，我收到以下错误：

"Number of labels=41 does not match number of samples=42"。

我使用以下代码生成了测试集和训练集：

otu_trans = test_train.transpose()
# transpose otu table 
merged = pd.concat([otu_trans, metadata[status]], axis=1, join='inner')
# merge phenotype column from metadata file with transposed otu table

X = merged.drop([status],axis=1)

# drop status from X 
y = merged[status]


encoder = LabelEncoder()
y = pd.Series(encoder.fit_transform(y),
index=y.index, name=y.name)
# convert T and TF lables to 0 and 1 respectively

loocv = LeaveOneOut()
loocv.get_n_splits(X)

for train_index, test_index in loocv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, X_test, y_train, y_test)

input data

当我检查 X_train 和 X_test 的形状时，它是 42,41 而不是我认为应该的 41,257，因此看起来数据正在沿错误的轴划分。

任何人都可以向我解释为什么会这样吗？

谢谢

Answer 1

首先，初始矩阵X完全不会受到影响。它仅用于生成索引和拆分数据。

最初的形状 X 将始终相同。

现在，这是一个使用 LOOCV 拆分的简单示例：

import numpy as np
from sklearn.model_selection import LeaveOneOut

# I produce fake data with same dimensions as yours.
#fake data
X = np.random.rand(41,257)
#fake labels
y = np.random.rand(41)

#Now check that the shapes are correct:
X.shape
y.shape

这会给你：

(41, 257)
(41,)

现在拆分:

loocv = LeaveOneOut()
loocv.get_n_splits(X)

for train_index, test_index in loocv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    #classifier.fit(X_train, y_train)
    #classifier.predict(X_test)


X_train.shape
X_test.shape

这会打印：

(40, 257)
(1, 257)

如您所见，X_train 包含 40 个样本，而 X_test 仅包含 1 个样本。这是正确的，因为我们使用 LOOCV 拆分。

初始X矩阵总共有42个样本，所以我们使用41个用于训练，1个用于测试。

这个循环会产生很多X_train和X_test矩阵。具体来说，它将产生 N 矩阵，其中 N = number of samples（在我们的例子中：N = 41）。

N 等于 loocv.get_n_splits(X).

希望对您有所帮助

sklearn loocv.split 返回比预期更小的测试和训练数组

sklearn loocv.split returning a smaller test and train array than expected

python

scikit-learn

cross-validation