标签编码然后估算缺失然后逆编码

Question

我有一个关于警察杀人的数据集，你可以在 Kaggle 上找到。几列中缺少一些数据：

UID                0.000000
Name               0.000000
Age                0.018653
Gender             0.000640
Race               0.317429
Date               0.000000
City               0.000320
State              0.000000
Manner_of_death    0.000000
Armed              0.454487
Mental_illness     0.000000
Flee               0.000000
dtype: float64

我创建了原始 df 的副本对其进行编码，然后估算缺失值。我的计划是：

标签编码所有分类列：

Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed',
       'Mental_illness', 'Flee'],
      dtype='object')

le = LabelEncoder()
lpf = {}
for col in lepf.columns:    
    lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)

现在我有了所有类别都已编码的数据框。

然后，我在原始数据帧 (pf) 中找到了那些 nan 值，以替换 lpfdf 中那些编码的 nan：

for col in lpfdf:
    print(col,"\n",len(np.where(pf[col].to_frame().isna())[0]))

Gender 8
Race 3965
City 4 State 0 Manner_of_death 0 Armed 5677 Mental_illness 0
Flee 0

例如，Gender 有三个编码标签：0 代表 Male，1 代表 Female，2 代表 nan。但是，City 特征有 >3000 个值，无法使用 value_counts() 定位它。因此，我使用了：

np.where(pf["City"].to_frame().isna())

产生了：

(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0, 0], dtype=int64))

查看与索引对应的这些行中的任何一行，我看到城市的 nan 标签是 3327：

lpfdf.iloc[10549]

Gender                1
Race                  6
City               3327
State                10
Manner_of_death       1
Armed                20
Mental_illness        0
Flee                  0
Name: 10549, dtype: int64

然后我开始用这些标签替换 np.nan:

"""
Gender: 2,
Race: 6,
City: 3327,
Armed: 59

"""
lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)

创建迭代imputer的实例然后拟合和变换lpfdf:

itimp = IterativeImputer()
iilpf = itimp.fit_transform(lpfdf)

然后为这些新的估算值制作一个数据框：

itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)

最后，当我转到 inveres 变换以查看它估算的相应标签时，出现以下错误：

for col in lpfdf:    
    le.inverse_transform(itimplpf[col].astype(int))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
      1 for col in lpfdf:
----> 2     le.inverse_transform(itimplpf[col].astype(int))

~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
    158         diff = np.setdiff1d(y, np.arange(len(self.classes_)))
    159         if len(diff):
--> 160             raise ValueError(
    161                     "y contains previously unseen labels: %s" % str(diff))
    162         y = np.asarray(y)

ValueError: y contains previously unseen labels: [2 3 4 5]

我的步骤有什么问题吗？很抱歉我冗长的解释，但我觉得我需要解释所有的步骤，这样你才能正确地理解这个问题。谢谢大家

Answer 1

您先对分类值进行编码然后对缺失值进行插补的方法容易出现问题，因此不推荐使用。

一些插补策略，如 IterativeImputer，不能保证输出只包含以前已知的数值。这可能会导致编码器未知的估算值，并会在逆变换时导致错误（这正是您的情况）。

最好首先为数字和分类特征估算缺失值，然后然后对分类特征进行编码。一种选择是使用 SimpleImputer 并用最常见的类别或新的常量值替换缺失值。

此外，关于 LabelEncoder 的注释：在其 documentation 中明确提到：

This transformer should be used to encode target values, i.e. y, and not the input X.

如果您坚持像 LabelEncoder 这样的编码策略，您可以使用 OrdinalEncoder，它的作用相同，但实际上是用于特征编码。但是，您应该意识到，这样的编码策略可能会错误地暗示每个特征类别之间的序数关系，这可能会导致不良后果。因此，您还应该考虑其他编码策略。

Answer 2

一种可能值得探索的可能性是使用机器学习算法预测缺失的分类（编码）值，例如sklearn.ensemble.RandomForestClassifier.

在这里，您将训练一个多类分类模型来预测每一列的缺失值。您首先要用一个神奇的值（例如 -99）替换缺失值，然后对它们进行单热编码。接下来，训练分类模型以预测所选列的分类值，使用其他列的单热编码值作为训练数据。当然，训练数据会排除缺少要预测的列的行。最后，根据缺少该列的行组成一个“测试”集，预测值，并将这些值归因于该列。对需要估算缺失值的每一列重复此操作。

假设您想稍后将机器学习技术应用于此数据，一个更深层次的问题是数据集的某些示例中缺少值是否实际上可能携带有用的信息来预测您的目标，因此，特定的插补策略是否会破坏该信息。

编辑：下面是我的意思的示例，使用虚拟数据。

import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from catboost import CatBoostClassifier

# create some fake data
n_samples = 1000
n_features = 20
features_og, _ = make_classification(n_samples=n_samples, n_features=n_features,n_informative=3, n_repeated= 16, n_redundant = 0)

# convert to fake categorical data
features_og = (features_og*10).astype(int)

# add missing value flag (-99) at random
features = features_og.copy()
for i in range(n_samples):
    for j in range(n_features):    
        if np.random.random() > 0.85:
            features[i,j] = -99

# go column by column predicting and replacing missing values
features_fixed = features.copy()
for j in range(n_features):   
    # do train test split based on whether the selected column value is -99.
    train = features[np.where(features[:,j] != -99)]
    test = features[np.where(features[:,j] == -99)]

    clf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
    
    # potentially better for categorical features is CatBoost:
    #clf = CatBoostClassifier(n_estimators= 300,cat_features=[identify categorical features here])
    
    # train the classifier to predict the value of column j using the other columns
    clf.fit(train[:,[x for x in range(n_features) if x != j]], train[:,j])
    
    # predict values for elements of column j that have the missing flag
    preds = clf.predict(test[:,[x for x in range(n_features) if x != j]])
    
    # substitute the missing values in column j with the predicted values
    features_fixed[np.where(features[:,j] == -99.),j] = preds

标签编码然后估算缺失然后逆编码

Label encode then impute missing then inverse encoding

python

pandas

scikit-learn

imputation

label-encoding