在转换数据以使用 sklearn 模型进行训练和预测时,如何确保所有数据都是相同的数字?
How do I make sure that all data is going to be the same number when transforming data for training and predicting with sklearn models?
我想确保传入数据集中的数据与训练模型的数据相同。例如...
df = pd.Dataframe({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
转换后应该如下所示:
>>>df
prediction features
1 1
2 2
3 3
现在我想确保一组新数据...
new_df = pd.Dataframe({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
将被转换为与原始DataFrame相同的df
。请注意,我确实在 new_df
中添加了一些内容,因为模型也必须处理它。新的数据框应该看起来像这样......
>>>new_df
prediction features
4 3
1 2
2 1
如何实现这一点以及如何反向转换数据?
你可以在这里使用LabelEncoder
。
import pandas as pd
df = pd.DataFrame({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df["prediction"])
oldData = df['prediction'].tolist()
df["prediction"] = le.transform(df["prediction"])
new_df = pd.DataFrame({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
newData = new_df['prediction'].tolist()
newData = list(set(newData)- set(oldData))
le.classes_ = np.append(le.classes_, newData )
new_df["prediction"] = le.transform(new_df["prediction"])
更新
import pandas as pd
df = pd.DataFrame({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
from sklearn import preprocessing
encoderDict = {}
oldData = {}
for col in df.columns:
le = preprocessing.LabelEncoder()
le.fit(df[col])
encoderDict[col] = le
oldData[col] = df[col].tolist()
df[col] = le.transform(df[col])
new_df = pd.DataFrame({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
newData = {}
for col in new_df.columns:
newData[col] = new_df[col].tolist()
newData[col] = list(set(newData[col])- set(oldData[col]))
encoderDict[col].classes_ = np.append(encoderDict[col].classes_, newData[col] )
new_df[col] = encoderDict[col].transform(new_df[col])
要对数据进行反向转换,您只需执行以下操作。
ndf = df.append(new_df).reset_index(drop=True)
for col in ndf:
print(encoderDict[col].inverse_transform(ndf[col]))
我想确保传入数据集中的数据与训练模型的数据相同。例如...
df = pd.Dataframe({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
转换后应该如下所示:
>>>df
prediction features
1 1
2 2
3 3
现在我想确保一组新数据...
new_df = pd.Dataframe({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
将被转换为与原始DataFrame相同的df
。请注意,我确实在 new_df
中添加了一些内容,因为模型也必须处理它。新的数据框应该看起来像这样......
>>>new_df
prediction features
4 3
1 2
2 1
如何实现这一点以及如何反向转换数据?
你可以在这里使用LabelEncoder
。
import pandas as pd
df = pd.DataFrame({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df["prediction"])
oldData = df['prediction'].tolist()
df["prediction"] = le.transform(df["prediction"])
new_df = pd.DataFrame({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
newData = new_df['prediction'].tolist()
newData = list(set(newData)- set(oldData))
le.classes_ = np.append(le.classes_, newData )
new_df["prediction"] = le.transform(new_df["prediction"])
更新
import pandas as pd
df = pd.DataFrame({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
from sklearn import preprocessing
encoderDict = {}
oldData = {}
for col in df.columns:
le = preprocessing.LabelEncoder()
le.fit(df[col])
encoderDict[col] = le
oldData[col] = df[col].tolist()
df[col] = le.transform(df[col])
new_df = pd.DataFrame({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
newData = {}
for col in new_df.columns:
newData[col] = new_df[col].tolist()
newData[col] = list(set(newData[col])- set(oldData[col]))
encoderDict[col].classes_ = np.append(encoderDict[col].classes_, newData[col] )
new_df[col] = encoderDict[col].transform(new_df[col])
要对数据进行反向转换,您只需执行以下操作。
ndf = df.append(new_df).reset_index(drop=True)
for col in ndf:
print(encoderDict[col].inverse_transform(ndf[col]))