SMOTE python

Question

我正在尝试在 python 中使用 SMOTE 并查看是否有任何方法可以手动指定少数样本的数量。

假设我们有一个 class 的 100 条记录和另一个 class 的 10 条记录，如果我们使用比率 = 1，我们得到 100:100，如果我们使用比率 1/2，我们得到 100:200。但是我正在寻找是否有任何方法可以手动指定要为 classes.

生成的实例数

    Ndf_class_0_records = trainData[trainData['DIED'] == 0]
    Ndf_class_1_records = trainData[trainData['DIED'] == 1]
    Ndf_class_0_record_counts = Ndf_class_0_records.DIED.value_counts()
    Ndf_class_1_record_counts = Ndf_class_1_records.DIED.value_counts()
    X_smote = trainData.drop("DIED", axis=1)
    y_smote = trainData["DIED"]
    smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
    X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)

在上面的代码中，我试图为每个 classes 手动指定编号，但在代码的最后一行出现以下错误

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Answer 1

如果我对您和文档的理解正确 here，您没有将数字作为比率传递。您正在传递一个系列对象。

接受的比率类型是：

float, str, dict or callable, (default=’auto’)

请尝试做：

Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = len(Ndf_class_0_records) ##### CHANGED THIS
Ndf_class_1_record_counts = len(Ndf_class_1_records) ##### CHANGED THIS
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)

现在应该可以了，请试试！

SMOTE python

SMOTE in python

python

machine-learning

dataframe

oversampling