训练具有相同索引的测试拆分

Question

我希望具有相同索引的行存在于同一组中 - 训练或测试，但不能同时存在。我怎样才能在 sklearn 中做到这一点？例如：

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6, 6], 'B': random.sample(range(10, 100), 12)})
df.set_index('A', inplace = True)

我想达到：

索引为 1、3、5、6 的训练集索引为 2、4

的测试集

如何使用 GridSearchCV 确保这一点？

Answer 1

将它们设为'group'。 sklearn 中的大多数拆分器都支持一个名为 groups 的参数，可以将其设置为执行您想要的操作

示例：

您可以使用 GroupKFold or GroupShuffleSplit:

group_kfold = GroupKFold(n_splits=3)
for train_index, test_index in group_kfold.split(df, groups=df.index):
    print("Train", df.iloc[train_index].index)
    print("Test", df.iloc[test_index].index)

Output: 
('Train', Int64Index([1, 1, 1, 2, 2, 3, 4, 4], dtype='int64', name=u'A'))
('Test', Int64Index([5, 6, 6, 6], dtype='int64', name=u'A'))

('Train', Int64Index([2, 2, 4, 4, 5, 6, 6, 6], dtype='int64', name=u'A'))
('Test', Int64Index([1, 1, 1, 3], dtype='int64', name=u'A'))

('Train', Int64Index([1, 1, 1, 3, 5, 6, 6, 6], dtype='int64', name=u'A'))
('Test', Int64Index([2, 2, 4, 4], dtype='int64', name=u'A'))

您可以看到上次列车测试拆分符合您的要求。所有折叠都将包含训练或测试数据，但不会同时包含这两种数据。

训练具有相同索引的测试拆分

Train Test Split with same index

python

python-3.x

pandas

scikit-learn

cross-validation