对数据集进行分层，同时避免索引污染？

Question

作为一个可重现的例子，我有以下数据集：

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = np.random.randint(0,20,size=(300, 5))
df = pd.DataFrame(data, columns=['ID', 'A', 'B', 'C', 'D'])
df = df.set_index(['ID'])

df.head()
Out: 
           A   B   C   D
ID                
12         3  14   4   7
9          5   9   8   4
12         18  17   3  14
1          0  10   1   0
9          10   5  11   9

我需要执行 70%-30% 的分层拆分（在 y 上），我知道它看起来像这样：

# Train/Test Split
X = df.iloc[:,0:-1] # Columns A, B, and C
y = df.iloc[:,-1] # Column D
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, test_size = 0.30, stratify = y)

但是，尽管我希望训练集和测试集具有相同（或足够相似）的“D”分布，但我不希望测试集和训练集都出现相同的“ID”。

我该怎么做？

Answer 1

编辑：一种方法（类似于）您所要求的可以是通过 class 存储 ID，然后对于每个 class 取 70% 的 ID 并将带有这些 ID 的样本插入到 Train 中，其余的插入到 Test 中设置。

请注意，如果每个 ID 出现的次数不同，这仍然不能保证分布相同。此外，鉴于每个 ID 可以属于 D 中的多个 classes，并且不应在 train 和 test 集合之间共享，寻求相同分布成为一个复杂的优化问题。这是因为每个 ID 只能包含在 train 或 test 中，同时将可变数量的 classes 取决于给定 ID 在其出现的所有行中的 classes。

在近似平衡分布的同时拆分数据的一种更简单的方法是以随机顺序遍历 classes 并仅考虑出现的 classes 之一的每个 ID ，因此将其分配给 train/test 的所有行，因此将其删除以备将来使用 classes.

我发现将 ID 视为一列有助于完成此任务，因此我按如下方式更改了您提供的代码：

# Given snippet (modified)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = np.random.randint(0,20,size=(300, 5))
df = pd.DataFrame(data, columns=['ID', 'A', 'B', 'C', 'D'])

建议的解决方案：

import random
from collections import defaultdict

classes = df.D.unique().tolist() # get unique classes,
random.shuffle(classes)          # shuffle to eliminate positional biases
ids_by_class = defaultdict(list)


# iterate over classes
temp_df = df.copy()
for c in classes:
    c_rows = temp_df.loc[temp_df['D'] == c] # rows with given class
    ids = temp_df.ID.unique().tolist()      # IDs in these rows
    ids_by_class[c].extend(ids)

    # remove ids so they cannot be taken into account for other classes
    temp_df = temp_df[~temp_df.ID.isin(ids)]


# now construct ids split, class by class
train_ids, test_ids = [], []
for c, ids in ids_by_class.items():
    random.shuffle(ids) # shuffling can eliminate positional biases

    # split the IDs
    split = int(len(ids)*0.7) # split at 70%

    train_ids.extend(ids[:split])
    test_ids.extend(ids[split:])

# finally use the ids in train and test to get the
# data split from the original df
train = df.loc[df['ID'].isin(train_ids)]
test = df.loc[df['ID'].isin(test_ids)]

我们测试一下，split ratio大致符合70/30，数据被保留，train和test[=38=之间没有共享ID ]数据帧：

# 1) check that elements in Train are roughly 70% and Test 30% of original df print(f'Numbers of elements in train: {len(train)}, test: {len(test)}| Perfect split would be train: {int(len(df)*0.7)}, test: {int(len(df)*0.3)}') # 2) check that concatenating Train and Test gives back the original df train_test = pd.concat([train, test]).sort_values(by=['ID', 'A', 'B', 'C', 'D']) # concatenate dataframes into one, and sort sorted_df = df.sort_values(by=['ID', 'A', 'B', 'C', 'D']) # sort original df assert train_test.equals(sorted_df) # check equality # 3) check that the IDs are not shared between train/test sets train_id_set = set(train.ID.unique().tolist()) test_id_set = set(test.ID.unique().tolist()) assert len(train_id_set.intersection(test_id_set)) == 0

示例输出：

Numbers of elements in train: 209, test: 91| Perfect split would be train: 210, test: 90 Numbers of elements in train: 210, test: 90| Perfect split would be train: 210, test: 90 Numbers of elements in train: 227, test: 73| Perfect split would be train: 210, test: 90

对数据集进行分层，同时避免索引污染？

Stratify dataset while also avoiding contamination by Index?

python

pandas

scikit-learn

sklearn-pandas