我们能否通过接受(或忽略)新功能使 ML 模型(pickle 文件)更健壮?

Can we make the ML model (pickle file) more robust, by accepting (or ignoring) new features?

但是,我很挣扎。我有一列(包含字符串值),例如:

Sex       
Male       
Female
# This is just as example, in real it is having much more unique values

问题来了。我收到了一个新的(唯一的)值,现在我无法再进行预测(例如添加了 'Neutral')。

由于我将 'Sex' 列转换为 Dummies,我确实遇到了我的模型不再接受输入的问题,

Number of features of the model must match the input. Model n_features is 2 and input n_features is 3

因此我的问题是:有没有办法让我的模型更健壮,而忽略这个 class?但是在没有具体信息的情况下进行预测?

我试过的:

df = pd.read_csv('dataset_that_i_want_to_predict.csv')
model = pickle.load(open("model_trained.sav", 'rb'))

# I have an 'example_df' containing just 1 row of training data (this is exactly what the model needs)
example_df = pd.read_csv('reading_one_row_of_trainings_data.csv')

# Checking for missing columns, and adding that to the new dataset 
missing_cols = set(example_df.columns) - set(df.columns)
for column in missing_cols:
    df[column] = 0 #adding the missing columns, with 0 values (Which is ok. since everything is dummy)

# make sure that we have the same order 
df = df[example_df.columns] 

# The prediction will lead to an error!
results = model.predict(df)

# ValueError: Number of features of the model must match the input. Model n_features is X and n_features is Y

请注意,我进行了搜索,但找不到任何有用的解决方案(不是 , here or here

更新

还找到了 this 篇文章。但同样的问题在这里..我们可以使测试集与训练集具有相同的列...但是新的现实世界数据呢(例如新值'Neutral')?

是的,在训练部分完成后,您不能将新类别或特征包含(更新模型)到数据集中。 OneHotEncoder 可能会处理在测试数据的某些特征中包含新类别的问题。 它将负责使您的训练和测试数据中的列与分类变量保持一致。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(print_changed_only=True)
df = pd.DataFrame({'feature_1': np.random.rand(20),
                   'feature_2': np.random.choice(['male', 'female'], (20,))})
target = pd.Series(np.random.choice(['yes', 'no'], (20,)))

model = Pipeline([('preprocess',
                   ColumnTransformer([('ohe',
                                       OneHotEncoder(handle_unknown='ignore'), [1])],
                                       remainder='passthrough')),
                  ('lr', LogisticRegression())])

model.fit(df, target)

# let us introduce new categories in feature_2 in test data
test_df = pd.DataFrame({'feature_1': np.random.rand(20),
                        'feature_2': np.random.choice(['male', 'female', 'neutral', 'unknown'], (20,))})
model.predict(test_df)
# array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes'], dtype=object)