仅对分类变量使用 SMOTE-NC

Using SMOTE-NC with categorical variables only

我正在处理仅包含分类特征的数据框。为了重现我面临的问题,我将制作以下示例:

d = {'col1':['a','b','c','a','c','c','c','c','c','c'],
     'col2':['a1','b1','c1','a1','c1','c1','c1','c1','c1','c1'],
     'col3':[1,2,3,2,3,3,3,3,3,3]}
data = pd.DataFrame(d)

我打算将数据拆分为测试和训练,并将 col3 作为我的目标特征。

train_data, test_data = train_test_split(data, test_size=0.2)
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

X_train = train_data.drop(['col3'], axis = 1)
X_test = test_data.drop(['col3'], axis = 1)
y_train = train_data["col3"]
y_test = test_data["col3"]

从 X_train 开始,col1 和 col2 是我的分类特征,因此索引为 0 和 1,因此我将 SMOTE-NC 设为:

from imblearn.over_sampling import SMOTENC
cat_indx =[0,1]
sm = SMOTENC(categorical_features= cat_indx, random_state=0)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

为此我收到以下错误:

ValueError: SMOTE-NC is not designed to work only with categorical features. It requires some numerical features.

鉴于 SMOTE-NC 旨在处理分类变量这一事实,我想知道如何解决这个问题?另请注意,我的目标变量是多类而不是二进制,我认为这不会导致此级别的任何问题。

请注意,算法名称中的首字母 NC 的意思是 Nominal-Continuous;正如错误消息明确指出的那样,该算法 不是 旨在仅处理分类(标称)特征。

要了解为什么会这样,您必须深入研究一下原文 SMOTE paper; quoting from the relevant section(强调我的):

While our SMOTE approach currently does not handle data sets with all nominal features, it was generalized to handle mixed datasets of continuous and nominal features. We call this approach Synthetic Minority Over-sampling TEchnique-Nominal Continuous [SMOTE-NC]. We tested this approach on the Adult dataset from the UCI repository. The SMOTE-NC algorithm is described below.

  1. Median computation: Compute the median of standard deviations of all continuous features for the minority class. If the nominal features differ between a sample and its potential nearest neighbors, then this median is included in the Euclidean distance computation. We use median to penalize the difference of nominal features by an amount that is related to the typical difference in continuous feature values.
  2. Nearest neighbor computation: Compute the Euclidean distance between the feature vector for which k-nearest neighbors are being identified (minority class sample) and the other feature vectors (minority class samples) using the continuous feature space. For every differing nominal feature between the considered feature vector and its potential nearest-neighbor, include the median of the standard deviations previously computed, in the Euclidean distance computation.

因此,很明显,为了使算法起作用,它需要 至少一个连续特征。这里的情况并非如此,因此算法在第 1 步(中值计算)期间毫无疑问地失败了,因为没有任何连续特征可用于中值计算。