如何保留训练数据中的列以在 python 中进行预测

Question

我有一个如下所示的数据集：

| Amount   | Source | y |
| -------- | ------ | - |
| 285      | a      | 1 |
| 556      | b      | 0 | 
| 883      | c      | 0 |
| 156      | c      | 1 |
| 374      | a      | 1 |
| 1520     | d      | 0 |

'Source' 是分类变量。此字段中的类别是 'a'、'b'、'c' 和 'd'。因此，一个热编码列是 'source_a'、'source_b'、'source_c' 和 'source_d'。我正在使用这个模型来预测 y 的值。用于预测的新数据不包含训练中使用的所有类别。它只有类别 'a'、'c' 和 'd'。当我对这个数据集进行热编码时，它缺少列 'source_b'。我如何将此数据转换为看起来像训练数据？

PS：我正在使用 XGBClassifier() 进行预测。

Answer 1

使用相同的编码器实例。假设您选择了 sklearn 的一个热编码器，您所要做的就是将其导出为泡菜，以便稍后在需要时使用它进行推理。

from sklearn.preprocessing import OneHotEncoder
import pickle
# blah blah blah

enc = OneHotEncoder(handle_unknown='ignore')
#assume X_train = the source column
X_train = enc.fit_transform(X_train)
pickle.dump(enc, open('onehot.pickle', 'wb'))

然后加载它进行推理：

import pickle
loaded_enc = pickle.load(open("onehot.pickle", "rb"))

那么你所要做的就是点击：

#X_test is the source column of your test data
X_test = loaded_enc.transform(X_test)

一般来说，在您将编码器安装到 X_train 之后，您所要做的就是简单地转换测试集。所以

X_test = loaded_enc.transform(X_test)

Answer 2

明确写下：

import pandas as pd
import numpy as np

# an example of your dataframe with no "b" source
df = pd.DataFrame({
                    "Amount" : [int(i) for i in np.random.normal(800,300, 10)],
                    "Source" : np.random.choice(["a", "c", "d"], 10),
                    "y"      : np.random.choice([1,0], 10)
                    })
# One Hot Encoding
df["source_a"] = np.where(df.Source == "a",1,0)

df["source_b"] = np.where(df.Source == "b",1,0)

df["source_c"] = np.where(df.Source == "c",1,0)

df["source_d"] = np.where(df.Source == "d",1,0)

数据帧的输出：

   Amount Source  y  source_a  source_b  source_c  source_d
0     685      d  0         0         0         0         1
1    1149      c  1         0         0         1         0
2    1220      a  0         1         0         0         0
3     834      c  0         0         0         1         0
4     780      c  0         0         0         1         0
5     502      a  0         1         0         0         0
6     191      c  1         0         0         1         0
7     637      c  0         0         0         1         0
8     701      d  0         0         0         0         1
9     941      c  1         0         0         1         0

对于一般规则，依赖性必须最小化...

如何保留训练数据中的列以在 python 中进行预测

How to retain the columns from training data for prediction in python

python

prediction

scikit-learn