
Label encode then impute missing then inverse encoding

我有一个关于警察杀人的数据集,你可以在 Kaggle 上找到。几列中缺少一些数据:

UID                0.000000
Name               0.000000
Age                0.018653
Gender             0.000640
Race               0.317429
Date               0.000000
City               0.000320
State              0.000000
Manner_of_death    0.000000
Armed              0.454487
Mental_illness     0.000000
Flee               0.000000
dtype: float64

我创建了原始 df 的副本对其进行编码,然后估算缺失值。我的计划是:

  1. 标签编码所有分类列:
Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed',
       'Mental_illness', 'Flee'],
le = LabelEncoder()
lpf = {}
for col in lepf.columns:    
    lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)


  1. 然后,我在原始数据帧 (pf) 中找到了那些 nan 值,以替换 lpfdf 中那些编码的 nan:
for col in lpfdf:

Gender 8
Race 3965
City 4 State 0 Manner_of_death 0 Armed 5677 Mental_illness 0
Flee 0

例如,Gender 有三个编码标签:0 代表 Male,1 代表 Female,2 代表 nan。但是,City 特征有 >3000 个值,无法使用 value_counts() 定位它。因此,我使用了:



(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0, 0], dtype=int64))

查看与索引对应的这些行中的任何一行,我看到城市的 nan 标签是 3327:


Gender                1
Race                  6
City               3327
State                10
Manner_of_death       1
Armed                20
Mental_illness        0
Flee                  0
Name: 10549, dtype: int64

然后我开始用这些标签替换 np.nan:

Gender: 2,
Race: 6,
City: 3327,
Armed: 59

lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)
  1. 创建迭代imputer的实例然后拟合和变换lpfdf:
itimp = IterativeImputer()
iilpf = itimp.fit_transform(lpfdf)


itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)

最后,当我转到 inveres 变换以查看它估算的相应标签时,出现以下错误:

for col in lpfdf:    
ValueError                                Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
      1 for col in lpfdf:
----> 2     le.inverse_transform(itimplpf[col].astype(int))

~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
    158         diff = np.setdiff1d(y, np.arange(len(self.classes_)))
    159         if len(diff):
--> 160             raise ValueError(
    161                     "y contains previously unseen labels: %s" % str(diff))
    162         y = np.asarray(y)

ValueError: y contains previously unseen labels: [2 3 4 5]

我的步骤有什么问题吗? 很抱歉我冗长的解释,但我觉得我需要解释所有的步骤,这样你才能正确地理解这个问题。谢谢大家


一些插补策略,如 IterativeImputer,不能保证输出只包含以前已知的数值。这可能会导致编码器未知的估算值,并会在逆变换时导致错误(这正是您的情况)。

最好首先为数字和分类特征估算缺失值,然后然后对分类特征进行编码。一种选择是使用 SimpleImputer 并用最常见的类别或新的常量值替换缺失值。

此外,关于 LabelEncoder 的注释:在其 documentation 中明确提到:

This transformer should be used to encode target values, i.e. y, and not the input X.

如果您坚持像 LabelEncoder 这样的编码策略,您可以使用 OrdinalEncoder,它的作用相同,但实际上是用于特征编码。但是,您应该意识到,这样的编码策略可能会错误地暗示每个特征类别之间的序数关系,这可能会导致不良后果。因此,您还应该考虑其他编码策略。


在这里,您将训练一个多类分类模型来预测每一列的缺失值。您首先要用一个神奇的值(例如 -99)替换缺失值,然后对它们进行单热编码。接下来,训练分类模型以预测所选列的分类值,使用其他列的单热编码值作为训练数据。当然,训练数据会排除缺少要预测的列的行。最后,根据缺少该列的行组成一个“测试”集,预测值,并将这些值归因于该列。对需要估算缺失值的每一列重复此操作。



import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from catboost import CatBoostClassifier

# create some fake data
n_samples = 1000
n_features = 20
features_og, _ = make_classification(n_samples=n_samples, n_features=n_features,n_informative=3, n_repeated= 16, n_redundant = 0)

# convert to fake categorical data
features_og = (features_og*10).astype(int)

# add missing value flag (-99) at random
features = features_og.copy()
for i in range(n_samples):
    for j in range(n_features):    
        if np.random.random() > 0.85:
            features[i,j] = -99

# go column by column predicting and replacing missing values
features_fixed = features.copy()
for j in range(n_features):   
    # do train test split based on whether the selected column value is -99.
    train = features[np.where(features[:,j] != -99)]
    test = features[np.where(features[:,j] == -99)]

    clf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
    # potentially better for categorical features is CatBoost:
    #clf = CatBoostClassifier(n_estimators= 300,cat_features=[identify categorical features here])
    # train the classifier to predict the value of column j using the other columns
    clf.fit(train[:,[x for x in range(n_features) if x != j]], train[:,j])
    # predict values for elements of column j that have the missing flag
    preds = clf.predict(test[:,[x for x in range(n_features) if x != j]])
    # substitute the missing values in column j with the predicted values
    features_fixed[np.where(features[:,j] == -99.),j] = preds