基于列值训练测试拆分 - 按顺序
Train test split based on a column values - sequentially
我有一个数据框如下
df = pd.DataFrame({"Col1": ['A','B','B','A','B','B','A','B','A', 'A'],
"Col2" : [-2.21,-9.59,0.16,1.29,-31.92,-24.48,15.23,34.58,24.33,-3.32],
"Col3" : [-0.27,-0.57,0.072,-0.15,-0.21,-2.54,-1.06,1.94,1.83,0.72],
"y" : [-1,1,-1,-1,-1,1,1,1,1,-1]})
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
6 A 15.23 -1.060 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
有没有办法拆分数据框(60:40 拆分),使 col1 的前 60% 的值用于训练,最后 40% 用于测试。
火车:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
测试:
Col1 Col2 Col3 y
5 B -24.48 -2.540 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
如需拆分不分组:
thresh = int(len(df) * 0.6)
train = df.iloc[:thresh]
test = df.iloc[thresh:]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
编辑:如果需要按组拆分,请使用 GroupBy.cumcount
和过滤创建阈值:
thresh = int(len(df) * 0.6 / df['Col1'].nunique())
print (thresh)
3
mask = df.groupby('Col1')['Col1'].cumcount() < thresh
train = df[mask]
test = df[~mask]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
print(test)
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
IIUC,可以使用numpy.split
:
import numpy as np
train, test = np.split(df, [int(len(df) * 0.6)])
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
我觉得你需要 groupby
这里
s=df.groupby('Col1').Col1.cumcount()#get the count for each group
s=s//(df.groupby('Col1').Col1.transform('count')*0.6).astype(int)# get the top 60% of each group
Train=df.loc[s==0].copy()
Test=df.drop(Train.index)
Train
Out[118]:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test
Out[119]:
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
我有一个数据框如下
df = pd.DataFrame({"Col1": ['A','B','B','A','B','B','A','B','A', 'A'],
"Col2" : [-2.21,-9.59,0.16,1.29,-31.92,-24.48,15.23,34.58,24.33,-3.32],
"Col3" : [-0.27,-0.57,0.072,-0.15,-0.21,-2.54,-1.06,1.94,1.83,0.72],
"y" : [-1,1,-1,-1,-1,1,1,1,1,-1]})
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
6 A 15.23 -1.060 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
有没有办法拆分数据框(60:40 拆分),使 col1 的前 60% 的值用于训练,最后 40% 用于测试。
火车:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
测试:
Col1 Col2 Col3 y
5 B -24.48 -2.540 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
如需拆分不分组:
thresh = int(len(df) * 0.6)
train = df.iloc[:thresh]
test = df.iloc[thresh:]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
编辑:如果需要按组拆分,请使用 GroupBy.cumcount
和过滤创建阈值:
thresh = int(len(df) * 0.6 / df['Col1'].nunique())
print (thresh)
3
mask = df.groupby('Col1')['Col1'].cumcount() < thresh
train = df[mask]
test = df[~mask]
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
print(test)
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
IIUC,可以使用numpy.split
:
import numpy as np
train, test = np.split(df, [int(len(df) * 0.6)])
print(train)
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
print(test)
Col1 Col2 Col3 y
6 A 15.23 -1.06 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1
我觉得你需要 groupby
这里
s=df.groupby('Col1').Col1.cumcount()#get the count for each group
s=s//(df.groupby('Col1').Col1.transform('count')*0.6).astype(int)# get the top 60% of each group
Train=df.loc[s==0].copy()
Test=df.drop(Train.index)
Train
Out[118]:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test
Out[119]:
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1