如何将数据拆分为训练和测试,同时对标签进行分层并防止同一实体同时出现?
How to split data into train and test while stratifying on labels and preventing the same entity from appearing both?
我有一个如下图所示的数据集。我想将其拆分以在标签上进行分层训练和测试。同时,我不希望同一个玩家同时出现。
比如我拆分的时候train:test=1:1.
player
utterances
label
Bob
...
1
John
...
1
Mary
...
0
Kethy
...
1
Jack
...
1
John
...
0
John
...
1
Mary
...
1
→
火车(标签 0 : 标签 1 = 1 : 3)
player
utterances
label
Bob
...
1
John
...
1
John
...
0
John
...
1
→
测试(标签 0 : 标签 1 = 1 : 3)
player
utterances
label
Mary
...
0
Mary
...
1
Kethy
...
1
Jack
...
1
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
grouped = df.groupby('player')
l=[grouped.get_group(x) for x in grouped.groups] # I have split dataframe via groupby
train,test = train_test_split(l,test_size=0.5)
while len(pd.concat(train)) != len(pd.concat(test)):
train,test = train_test_split(l,test_size=0.5) # I've split it so that each contains an equal number of elements.
train = pd.concat(train)
test = pd.concat(test)
受tako0707回答的启发,我将我的数据分为train、valid和test如下。
幸运的是,train、valid 和 test 的标签几乎是分层的。
import pandas as pd
utterances, labels, players = [...], [...], [...]
df = pd.DaraFrame(
dict(
utterances=utterances,
labels=labels,
players=players,
)
)
grouped = df.groupby('player')
groups = [grouped.get_group(x) for x in grouped.groups]
i = 0
train, train_size = [groups[i]], len(groups[i])
while train_size < len(labels) * 0.8:
i += 1
train_size += len(groups[i])
train.append(groups[i])
test, test_size = [groups[i]], len(groups[i])
while test_size < len(labels)* 0.1:
i += 1
test_size += len(groups[i])
test.append(groups[i])
valid, valid_size = [groups[i]], len(groups[i])
while valid_size < len(labels) * 0.1:
i += 1
valid_size += len(groups[i])
valid.append(groups[i])
train.extend(groups[i+1:])
train, valid, test = pd.concat(train), pd.concat(valid), pd.concat(test)
我有一个如下图所示的数据集。我想将其拆分以在标签上进行分层训练和测试。同时,我不希望同一个玩家同时出现。
比如我拆分的时候train:test=1:1.
player | utterances | label |
---|---|---|
Bob | ... | 1 |
John | ... | 1 |
Mary | ... | 0 |
Kethy | ... | 1 |
Jack | ... | 1 |
John | ... | 0 |
John | ... | 1 |
Mary | ... | 1 |
→
火车(标签 0 : 标签 1 = 1 : 3)
player | utterances | label |
---|---|---|
Bob | ... | 1 |
John | ... | 1 |
John | ... | 0 |
John | ... | 1 |
→
测试(标签 0 : 标签 1 = 1 : 3)
player | utterances | label |
---|---|---|
Mary | ... | 0 |
Mary | ... | 1 |
Kethy | ... | 1 |
Jack | ... | 1 |
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
grouped = df.groupby('player')
l=[grouped.get_group(x) for x in grouped.groups] # I have split dataframe via groupby
train,test = train_test_split(l,test_size=0.5)
while len(pd.concat(train)) != len(pd.concat(test)):
train,test = train_test_split(l,test_size=0.5) # I've split it so that each contains an equal number of elements.
train = pd.concat(train)
test = pd.concat(test)
受tako0707回答的启发,我将我的数据分为train、valid和test如下。
幸运的是,train、valid 和 test 的标签几乎是分层的。
import pandas as pd
utterances, labels, players = [...], [...], [...]
df = pd.DaraFrame(
dict(
utterances=utterances,
labels=labels,
players=players,
)
)
grouped = df.groupby('player')
groups = [grouped.get_group(x) for x in grouped.groups]
i = 0
train, train_size = [groups[i]], len(groups[i])
while train_size < len(labels) * 0.8:
i += 1
train_size += len(groups[i])
train.append(groups[i])
test, test_size = [groups[i]], len(groups[i])
while test_size < len(labels)* 0.1:
i += 1
test_size += len(groups[i])
test.append(groups[i])
valid, valid_size = [groups[i]], len(groups[i])
while valid_size < len(labels) * 0.1:
i += 1
valid_size += len(groups[i])
valid.append(groups[i])
train.extend(groups[i+1:])
train, valid, test = pd.concat(train), pd.concat(valid), pd.concat(test)