在输入缺失值后,LabelEncoder 不能 inverse_transform(看不见的标签)

LabelEncoder cannot inverse_transform (unseen labels) after imputing missing values

我处于初级到中级数据科学水平。我想使用 knn 从数据框中估算缺失值。

由于数据帧包含字符串和 floats,我需要使用 LabelEncoder 对值进行编码/解码。

我的方法如下:

  1. Replace NaN to be able to encode
  2. Encode the text values and put them in a dictionary
  3. Retrieve the NaN (previously converted) to be imputed with knn
  4. Assign values with knn
  5. Decode values from the dictionary

不幸的是,在最后一步中,插补值添加了无法解码的新值(unseen labels 错误消息)。

你能解释一下我做错了什么吗?理想情况下,请帮助我更正它。在结束之前,我想说我知道还有 OneHotEncoder 等其他工具,但我对它们的了解还不够,我发现 LabelEncoder 更直观,因为您可以直接在数据框中看到它(其中 LabelEncoder 提供一个数组)。

请在下面找到我的方法示例,非常感谢您的帮助

[1]

# Import libraries. 
import pandas as pd 
import numpy as np

# intialise data of lists. 
data = {'Name':['Jack', np.nan, 'Victoria', 'Nicolas', 'Victor', 'Brad'], 'Age':[59, np.nan, 29, np.nan, 65, 50], 'Car color':['Blue', 'Black', np.nan, 'Black', 'Grey', np.nan], 'Height ':[177, 150, np.nan, 180, 175, 190]} 

# Make a DataFrame 
df = pd.DataFrame(data) 

# Print the output. 
df 

Output : 
    Name    Age     Car color   Height
0   Jack    59.0    Blue    177.0
1   NaN     NaN     Black   150.0
2   Victoria    29.0    NaN     NaN
3   Nicolas     NaN     Black   180.0
4   Victor  65.0    Grey    175.0
5   Brad    50.0    NaN     190.0

[2]

# LabelEncoder does not work with NaN values, so I replace them with value '1000' : 
df = df.replace(np.nan, 1000)

# And to avoid errors, str columns must be set as strings (even '1000' value) : 
df[['Name','Car color']] = df[['Name','Car color']].astype(str)

df

Output 
    Name    Age     Car color   Height
0   Jack    59.0    Blue    177.0
1   1000    1000.0  Black   150.0
2   Victoria    29.0    1000    1000.0
3   Nicolas     1000.0  Black   180.0
4   Victor  65.0    Grey    175.0
5   Brad    50.0    1000    190.0

[3]

# Import LabelEncoder library : 
from sklearn.preprocessing import LabelEncoder

# define labelencoder :
le = LabelEncoder()

# Import defaultdict library to make a dict of labelencoder :
from collections import defaultdict

# Initiate a dict of LabelEncoder values :
encoder_dict = defaultdict(LabelEncoder)

# Make a new dataframe of LabelEncoder values :
df[['Name','Car color']] = df[['Name','Car color']].apply(lambda x: encoder_dict[x.name].fit_transform(x))

# Show output :
df

Output 
    Name    Age     Car color   Height
0   2   59.0    2   177.0
1   0   1000.0  1   150.0
2   5   29.0    0   1000.0
3   3   1000.0  1   180.0
4   4   65.0    3   175.0
5   1   50.0    0   190.0

[4]

#Reverse back 1000 to missing values in order to impute them : 
df = df.replace(1000, np.nan)
df

Output 

    Name    Age     Car color   Height
0   2   59.0    2   177.0
1   0   NaN     1   150.0
2   5   29.0    0   NaN
3   3   NaN     1   180.0
4   4   65.0    3   175.0
5   1   50.0    0   190.0

[5]

# Import knn imputer library to replace impute missing values : 
from sklearn.impute import KNNImputer

# Define imputer : 
imputer = KNNImputer(n_neighbors=2)

# impute and reassign index/colonnes : 
df = pd.DataFrame(np.round(imputer.fit_transform(df)),columns = df.columns)
df

Output 

    Name    Age     Car color   Height
0   2.0     59.0    2.0     177.0
1   0.0     47.0    1.0     150.0
2   5.0     29.0    0.0     165.0
3   3.0     44.0    1.0     180.0
4   4.0     65.0    3.0     175.0
5   1.0     50.0    0.0     190.0

[6]

# Decode data : 
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)

# Apply it to df -> THIS IS WHERE ERROR OCCURS :
df[['Name','Car color']].apply(inverse_transform_lambda)

错误信息:

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-55-8a5e369215f6> in <module>()
----> 1 df[['Name','Car color']].apply(inverse_transform_lambda)

5 frames

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6926             kwds=kwds,
   6927         )
-> 6928         return op.get_result()
   6929 
   6930     def applymap(self, func):

/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):

/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_standard(self)
    290 
    291         # compute the result using the series generator
--> 292         self.apply_series_generator()
    293 
    294         # wrap results

/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_series_generator(self)
    319             try:
    320                 for i, v in enumerate(series_gen):
--> 321                     results[i] = self.f(v)
    322                     keys.append(v.name)
    323             except Exception as e:

<ipython-input-54-f16f4965b2c4> in <lambda>(x)
----> 1 inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)

/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
    297                     "y contains previously unseen labels: %s" % str(diff))
    298         y = np.asarray(y)
--> 299         return self.classes_[y]
    300 
    301     def _more_tags(self):

IndexError: ('arrays used as indices must be of integer (or boolean) type', 'occurred at index Name')

根据我的评论,你应该这样做

# Decode data : 
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x.astype(int)) # or x[].astype(int)