Imbalanced-Learn 的 FunctionSampler 抛出 ValueError

Question

我想使用 imblearn 中的 class FunctionSampler 创建我自己的自定义 class 以重新采样我的数据集。

我有一个包含每个主题路径的一维特征系列和一个包含每个主题标签的标签系列。两者都来自pd.DataFrame。我知道我必须首先重塑特征数组，因为它是一维的。

当我使用 class RandomUnderSampler 时一切正常，但是如果我先将特征和标签都传递给 [=13] 的 fit_resample 方法=] 然后创建 RandomUnderSampler 的实例，然后在 class 上调用 fit_resample，我收到以下错误：

ValueError: could not convert string to float: 'path_1'

这是一个产生错误的最小示例：

import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from imblearn import FunctionSampler

# create one dimensional feature and label arrays X and y
# X has to be converted to numpy array and then reshaped. 
X = pd.Series(['path_1','path_2','path_3'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])

第一种方法（有效）

rus = RandomUnderSampler()
X_res, y_res = rus.fit_resample(X,y)

第二种方法（无效）

def resample(X, y):
    return RandomUnderSampler().fit_resample(X, y)

sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)

有人知道这里出了什么问题吗？似乎 FunctionSampler 的 fit_resample 方法不等于 RandomUnderSampler 的 fit_resample 方法...

Answer 1

您对 FunctionSampler 的实施是正确的。问题出在您的数据集上。

RandomUnderSampler 似乎也适用于文本数据。没有使用 check_X_y.

进行检查

但是FunctionSampler()有这个检查，见here

from sklearn.utils import check_X_y

X = pd.Series(['path_1','path_2','path_2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])

check_X_y(X, y)

这将引发错误

ValueError: could not convert string to float: 'path_1'

下面的例子可以工作！

X = pd.Series(['1','2','2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])

def resample(X, y):
    return RandomUnderSampler().fit_resample(X, y)

sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)

X_res, y_res 
# (array([[2.],
#        [1.]]), array([0, 1], dtype=int64))

Imbalanced-Learn 的 FunctionSampler 抛出 ValueError

Imbalanced-Learn's FunctionSampler throws ValueError

python

pandas

scikit-learn

imblearn

第一种方法（有效）

第二种方法（无效）