OneHotEncoding : TypeError: cannot perform reduce with flexible type

OneHotEncoding : TypeError: cannot perform reduce with flexible type

我试图在 X_train 上安装 OneHotEncoder,然后在 X_train、X_test 上进行转换 然而,这导致了错误:

# One hot encoding 
from sklearn.preprocessing import OneHotEncoder
encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[encode_columns])
X_train = enc.transform(X_train[encode_columns])
X_test = enc.transform(X_test[encode_columns])
X_train.head()

错误:

      4 
      5 enc = OneHotEncoder(handle_unknown='ignore')
----> 6 enc.fit(X_train[encode_columns])
      7 X_train = enc.transform(X_train[encode_columns])
      8 X_test = enc.transform(X_test[encode_columns])

TypeError: cannot perform reduce with flexible type

X_train 的示例行:

TLDR:您 可能 运行 多次拟合和变换的单元格,并且 .transform() 不起作用,您认为它起作用.

为什么会出现此错误?

如果您在一个单元格中有数据定义:

X_train = pd.DataFrame({'borough': ["Queens", "Brooklyn", "Queens", "Queens", "Brooklyn"],
                        'building_class_category': ["01", "02", "02", "01", "13"], 
                        'commercial_units': ["O", "O", "O", "O", "A"],
                        'residential_units': [1,2,2,1,1]})

并在第二个中安装一个热编码器:

encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[encode_columns])
X_train = enc.transform(X_train[encode_columns])

上面的单元格第一次可以工作,但是如果你第二次 运行 单元格,你会覆盖 X_train:

TypeError: cannot perform reduce with flexible type

所以答案的第一部分将是 - 输入和输出的名称不同。

OneHotEncoder transform return 是什么?

如果你打印出来 enc.transform(X_train[encode_columns]) 你会得到:

<5x9 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>

默认情况下,OneHotEncoder transform 不是 return pandas DataFrame(甚至是 numpy 数组),而是 sparse matrix。要获得一个 numpy 数组,你必须转换它:

enc.transform(X_train[encode_columns]).toarray()

或在 OneHotEncoder 的定义中设置 sparse=False:

enc = OneHotEncoder(handle_unknown='ignore', sparse=False)

奖金:如何获得特征的描述性名称?

设置sparse=False后,enc.transform(X_train[encode_columns])会return numpy数组。即使您将其转换为 pd.DataFrame,列名也不会告诉您太多信息:

pd.DataFrame(enc.transform(X_train[encode_columns]))

#   0   1   2   3   4   5   6   7   8
#0  0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
#1  1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
#2  0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
#3  0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
#4  1.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0

要获得正确的列名,您必须使用 get_feature_names_out() 方法:

pd.DataFrame(enc.transform(X_train[encode_columns]), columns = enc.get_feature_names_out())

#   borough_Brooklyn    borough_Queens  ... residential_units_2
#0  0.0                 1.0             ... 0.0
#1  1.0                 0.0             ... 1.0
#2  0.0                 1.0             ... 1.0
#3  0.0                 1.0             ... 0.0
#4  1.0                 0.0             ... 0.0

完整代码:

X_train = pd.DataFrame({'borough': ["Queens", "Brooklyn", "Queens", "Queens", "Brooklyn"],
                        'building_class_category': ["01", "02", "02", "01", "13"], 
                        'commercial_units': ["O", "O", "O", "O", "A"],
                        'residential_units': [1,2,2,1,1]})
encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']

enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
enc.fit(X_train[encode_columns])
X_train_encoded = pd.DataFrame(enc.transform(X_train[encode_columns]), columns = enc.get_feature_names_out())