在转换数据以使用 sklearn 模型进行训练和预测时,如何确保所有数据都是相同的数字?

How do I make sure that all data is going to be the same number when transforming data for training and predicting with sklearn models?

我想确保传入数据集中的数据与训练模型的数据相同。例如...

df = pd.Dataframe({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})

转换后应该如下所示:

>>>df
prediction  features
1           1
2           2
3           3

现在我想确保一组新数据...

new_df = pd.Dataframe({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})

将被转换为与原始DataFrame相同的df。请注意,我确实在 new_df 中添加了一些内容,因为模型也必须处理它。新的数据框应该看起来像这样......

>>>new_df
prediction  features
4           3
1           2
2           1

如何实现这一点以及如何反向转换数据?

你可以在这里使用LabelEncoder

import pandas as pd
df = pd.DataFrame({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df["prediction"])
oldData = df['prediction'].tolist()
df["prediction"] = le.transform(df["prediction"])
new_df = pd.DataFrame({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
newData = new_df['prediction'].tolist()
newData = list(set(newData)- set(oldData))
le.classes_ = np.append(le.classes_, newData )
new_df["prediction"] = le.transform(new_df["prediction"])

更新

import pandas as pd
df = pd.DataFrame({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
from sklearn import preprocessing
encoderDict = {}
oldData = {}
for col in df.columns:
    le = preprocessing.LabelEncoder()
    le.fit(df[col])
    encoderDict[col] = le
    oldData[col] = df[col].tolist()
    df[col] = le.transform(df[col])
new_df = pd.DataFrame({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
newData = {}
for col in new_df.columns:
    newData[col] = new_df[col].tolist()
    newData[col] = list(set(newData[col])- set(oldData[col]))
    encoderDict[col].classes_ = np.append(encoderDict[col].classes_, newData[col] )
    new_df[col] = encoderDict[col].transform(new_df[col])

要对数据进行反向转换,您只需执行以下操作。

ndf = df.append(new_df).reset_index(drop=True)
for col in ndf:
    print(encoderDict[col].inverse_transform(ndf[col]))