拆分为具有特定行属性值的训练集和测试集

Question

我的输入文件格式如下：

gold,Attribute1,Attribute2
T,1,1
T,1,2
T,1,1
N,1,2
N,2,1
T,2,1
T,2,2
N,2,2
T,3,1
N,3,2
N,3,1
T,3,2
N,3,3
N,3,3

我正在尝试使用第二列和第三列来预测第一列。我想将这个输入数据随机分成一个训练集和一个测试集，这样所有具有值的特定组合的行要么落在测试集中，要么落在训练集中。例如，所有值为 <1,1>、<1,2>、<2,1> 的行都应属于训练集，所有值为 <2,2>、<3,1>、< 3,2>, <3,3> 应该属于测试集。这必须随机进行，这只是一个例子。我怎样才能做出这样的拆分？

Answer 1

一种简单的拆分方法是通过条件而不是预定义的方法。

代码：-

import numpy as np
import pandas as pd 

df = pd.DataFrame(pd.read_csv('test.csv'))

print(df.head())
print(df.describe())
print(type(df['Attribute1']))

#For only getting values where both are less than 2 or equal to 2
df_Condition1 = df[df['Attribute1'] <= 2]
Train_Set = df_Condition1[df_Condition1['Attribute2'] <= 2]

#to subract the remaining elements 
Test_Set = df[ df.isin(Train_Set) == False]
Test_Set =Test_Set.dropna()

print(Train_Set)
print(Test_Set)

输出：

   gold  Attribute1  Attribute2
   0    T           1           1
   1    T           1           2
   2    T           1           1
   3    N           1           2
   4    N           2           1
  
   Attribute1  Attribute2
   count   14.000000   14.000000
   mean     2.142857    1.714286  
   std      0.864438    0.726273
   min      1.000000    1.000000 
   25%      1.250000    1.000000
   50%      2.000000    2.000000
   75%      3.000000    2.000000
   max      3.000000    3.000000
   <class 'pandas.core.series.Series'>

       gold  Attribute1  Attribute2
   0    T           1           1
   1    T           1           2
   2    T           1           1
   3    N           1           2
   4    N           2           1
   5    T           2           1
   6    T           2           2
   7    N           2           2

      gold  Attribute1  Attribute2
   8     T         3.0         1.0
   9     N         3.0         2.0
   10    N         3.0         1.0
   11    T         3.0         2.0
   12    N         3.0         3.0
   13    N         3.0         3.0

拆分为具有特定行属性值的训练集和测试集

split into training set and test set with specific attribute values for rows

python

split

machine-learning

training-data