如何使用 sklearn train test split 拆分数据框，使每个类别的比例相等？

Question

我有一个包含 n 个自变量和一个分类变量的数据集，我想对其进行回归分析。每个类别的数据行数不同。我想将数据集拆分为测试和训练数据集，以便每个类别都有一个等效的训练测试拆分，例如80% 到 20%。这是我正在做的一个简化的可重现示例。

import pandas as pd
import string 
import numpy as np

from sklearn.model_selection import train_test_split

nrows=1000

cat_values = ['A','B','C','D']
# defining the category names
cats = np.random.choice(cat_values,  size=(nrows))

# creating a random dataframe
df = pd.DataFrame(np.random.randint(0,1000,size=(nrows, 3)), columns=['variable 1','variable 2','variable 3'])
df['category'] = cats

y = np.random.rand(nrows)

# using sklearn to split into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = .2, random_state =0)

# printing the number of rows in the output training data set for each category 
for i in range(len(cat_values)):
    print ("number of rows in category " + str(cat_values[i]) + ": " +  str(len(X_train[X_train['category']==cat_values[i]])))

输出：

number of rows in category A: 221
number of rows in category B: 188
number of rows in category C: 179
number of rows in category D: 212

我希望拆分行，例如80:20 train:test 每个分类变量。我看过使用 StratifiedShuffleSplit (Train/test split preserving class proportions in each split) but there doesn't seem to be an option for specifying which column to stratify the split on (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)。

是否有可以这种方式拆分数据的包，或者我是否必须将我的数据帧分成 n 个分类数据帧并在重新加入它们之前对每个数据帧执行不同的训练测试拆分？

感谢您对此的任何帮助。

Answer 1

使用 train_test_split 使用 stratify 参数：

X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=.2, random_state=0, stratify=y
)

如何使用 sklearn train test split 拆分数据框，使每个类别的比例相等？

How can I split a dataframe using sklearn train test split such that there are equal proportions for each category?

python

scikit-learn

train-test-split