如何将数据拆分为训练和测试,同时对标签进行分层并防止同一实体同时出现?

How to split data into train and test while stratifying on labels and preventing the same entity from appearing both?

我有一个如下图所示的数据集。我想将其拆分以在标签上进行分层训练和测试。同时,我不希望同一个玩家同时出现。

比如我拆分的时候train:test=1:1.

player utterances label
Bob ... 1
John ... 1
Mary ... 0
Kethy ... 1
Jack ... 1
John ... 0
John ... 1
Mary ... 1

火车(标签 0 : 标签 1 = 1 : 3)

player utterances label
Bob ... 1
John ... 1
John ... 0
John ... 1

测试(标签 0 : 标签 1 = 1 : 3)

player utterances label
Mary ... 0
Mary ... 1
Kethy ... 1
Jack ... 1
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

grouped = df.groupby('player')    
l=[grouped.get_group(x) for x in grouped.groups] # I have split dataframe via groupby

train,test =  train_test_split(l,test_size=0.5)
while len(pd.concat(train)) != len(pd.concat(test)):
    train,test =  train_test_split(l,test_size=0.5) # I've split it so that each contains an equal number of elements.

train = pd.concat(train)
test = pd.concat(test)

受tako0707回答的启发,我将我的数据分为train、valid和test如下。

幸运的是,train、valid 和 test 的标签几乎是分层的。

import pandas as pd

utterances, labels, players = [...], [...], [...]
df = pd.DaraFrame(
   dict(
     utterances=utterances,
     labels=labels,
     players=players,
   )
)

grouped = df.groupby('player')
groups = [grouped.get_group(x) for x in grouped.groups]
i = 0
train, train_size = [groups[i]], len(groups[i])

while train_size < len(labels) * 0.8:
    i += 1
    train_size += len(groups[i])
    train.append(groups[i])

test, test_size = [groups[i]], len(groups[i])
while test_size < len(labels)* 0.1:
    i += 1
    test_size += len(groups[i])
    test.append(groups[i])

valid, valid_size = [groups[i]], len(groups[i])
while valid_size < len(labels) * 0.1:
    i += 1
    valid_size += len(groups[i])
    valid.append(groups[i])

train.extend(groups[i+1:])

train, valid, test = pd.concat(train), pd.concat(valid), pd.concat(test)