InputField 的数据类型是双精度的，尽管在 PMMLPipeline 中它是字符串

Question

我正在将具有分类字符串特征 day_of_week 的 PMMLPipeline 导出为 PMML 文件。当我在 Java 中打开文件并列出 InputFields 时，我看到 day_of_week 字段的数据类型是双精度的：

InputField{name=day_of_week, fieldName=day_of_week, displayName=null, dataType=double, opType=categorical}

因此，当我评估输入时，出现错误：

org.jpmml.evaluator.InvalidResultException: Field "day_of_week" cannot accept user input value "tuesday"

在 Python 端，管道使用字符串列：

data = pd.DataFrame(data=[{"age": 10, "day_of_week": "tuesday"}])
y = trained_model.predict(X=data)

创建 PMML 文件的最小示例：

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

if __name__ == '__main__':

    data_dict = {
        'age': [1, 2, 3],
        'day_of_week': ['monday', 'tuesday', 'wednesday'],
        'y': [5, 6, 7]
    }

    data = pd.DataFrame(data_dict, columns=data_dict)

    numeric_features = ['age']
    numeric_transformer = Pipeline(steps=[
        ('scaler', StandardScaler())])

    categorical_features = ['day_of_week']
    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('numerical', numeric_transformer, numeric_features),
            ('categorical', categorical_transformer, categorical_features)])

    pipeline = PMMLPipeline(
        steps=[
            ('preprocessor', preprocessor),
            ('classifier', RandomForestRegressor(n_estimators=60))])

    X = data.drop(labels=['y'], axis=1)
    y = data['y']

    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=30)

    trained_model = pipeline.fit(X=X_train, y=y_train)
    sklearn2pmml(pipeline=pipeline, pmml='RandomForestRegressor2.pmml', with_repr=True)

编辑： sklearn2pmml 创建一个带有 DataDictionary 的 PMML 文件，DataField "day_of_week" 具有 dataType="double"。我觉得应该是"String"。我是否必须在某处设置数据类型才能更正此问题？

<DataDictionary>
    <DataField name="day_of_week" optype="categorical" dataType="double">

Answer 1

您可以通过使用 sklearn2pmml.decoration.CategoricalDomain 和 sklearn2pmml.decoration.ContinuousDomain 装饰器提供 "feature type hints" 来协助 SkLearn2PMML（有关详细信息，请参阅 here）。

在当前情况下，您应该在处理分类特征的管道中添加一个 CategoricalDomain 步骤：

from sklearn2pmml.decoration import CategoricalDomain

categorical_transformer = Pipeline(steps=[
    ('domain', CategoricalDomain(dtype = str))
    ('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))
])

InputField 的数据类型是双精度的，尽管在 PMMLPipeline 中它是字符串

DataType of InputField is double although in the PMMLPipeline it is string

scikit-learn

pmml