将一个数据框分成两个但已经知道一个

Question

我有一个 dataframe，其中有一列称为 "label"，它表示二进制特征 [0,1]。数据帧是不平衡的，标签 0 比 1 多，因此，为了构建一个好的估计器，我想将数据分成训练和测试子集，其中训练子集必须很好地平衡。我可以尝试使用重采样算法，如 SMOTE 或其他算法；然而，我决定采用以下策略：

Select dataframe 中所有带有标签 1 的行，并从中随机选择 80%，例如：

train_class1=dataframe[dataframe["label"]==1].iloc[np.random.randint(0, len(dataframe[dataframe["label"]==1]), len(dataframe[dataframe["label"]==1])*80//100)]

然后，从标签为 0 的行中，我做了一个与 train_class1 大小相同的随机子选择，并将其命名为 train_class0，例如：

train_class0=dataframe[dataframe["label"]==0].iloc[np.random.randint(0, len(dataframe[dataframe["label"]==0]), len(dataframe[dataframe["label"]==1])*80//100)]

所以我计划按行连接两个数据帧作为我的训练子样本：

train_class=pd.concat([train_class1,train_class0])

现在，作为测试子样本，我希望它是初始 dataframe 的其余部分，即：dataframe 中不属于 train_class 的所有行。我尝试了以下方法：

test_class =pd.concat([dataframe, train_class]).drop_duplicates()

将初始 dataframe 与 train_class 连接起来并删除重复的行。

然而这看起来很正常（至少在这一点上对我来说是这样），当我检查 dataframe、train_class 和 test_class 的形状时，我得到：

dataframe.shape=(257673, 208)

train_class.shape=(263476, 208)

test_class.shape=(257673, 208)

这显然是矛盾的。

我在代码中做错了什么？

Answer 1

我真的解决了问题...

在train_class1和train_class0的定义中，我改为：

train_class1=dataframe[dataframe["label"]==1].sample(len(dataframe[dataframe["label"]==0])*80//100)
train_class0=dataframe[dataframe["label"]==0].sample(len(dataframe[dataframe["label"]==0])*80//100)

通过使用内置 pandas 函数 df.sample()。

将一个数据框分成两个但已经知道一个

Split a dataframe into two but knowing already one

python

coding-style

dataframe

pandas