XGboost Google-AI-Model 期望浮点值而不是使用分类值并转换它们

XGboost Google-AI-Model expecting float values instead of using Categorical values and converting them

我正在尝试 运行 基于 Google 云的简单 XGBoost 预测,使用这个简单示例 https://cloud.google.com/ml-engine/docs/scikit/getting-predictions-xgboost#get_online_predictions

模型构建良好,但是当我尝试 运行 使用示例输入 JSON 进行预测时,它失败并出现错误 "Could not initialize DMatrix from inputs: could not convert string to float:",如下面的屏幕所示。我知道这是因为测试输入有字符串,我希望 Google 机器学习模型应该有将分类值转换为浮点数的信息。我不能指望我的用户提交带有浮点值的在线预测请求。

根据教程,它应该可以在不将分类值转换为浮点数的情况下工作。请告知,我附上了包含更多详细信息的 GIF。谢谢

import json
import numpy as np
import os
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

# these are the column labels from the census data files
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)

# load training set
with open('./census_data/adult.data', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# remove column we are trying to predict ('income-level') from features list
train_features = raw_training_data.drop('income-level', axis=1)
# create training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')


# load test set
with open('./census_data/adult.test', 'r') as test_data:
    raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# remove column we are trying to predict ('income-level') from features list
test_features = raw_testing_data.drop('income-level', axis=1)
# create training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')

# convert data in categorical columns to numerical values
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
    train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
    test_features[col] = encoders[col].fit_transform(test_features[col])

# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)

# train XGBoost model
bst = xgb.train({}, dtrain, 20)
bst.save_model('./model.bst')

您可以使用 pandas 将分类字符串转换为模型输入的代码。对于预测输入,您可以为每个类别定义一个字典,其中包含相应的类别值和代码。例如,对于工作类:

df['workclass_cat'] = df['workclass'].astype('category')
df['workclass_cat'] = df['workclass_cat'].cat.codes
workclass_dict = dict(zip(list(df['workclass'].values), list(df['workclass_cat'].values)))

如果预测输入是 'somestring',您可以按如下方式访问其代码:

category_input = workclass_dict['somestring']

XGBoost 模型将浮点数作为输入。在您的训练脚本中,您将分类变量转换为数字。提交预测时需要做同样的转换。

这是一个修复程序。将 Google 文档中显示的输入放在文件 input.json 中,然后 运行 这个。输出为 input_numerical.json,如果您使用它代替 input.json,预测将会成功。

此代码只是使用与训练和测试数据相同的过程将分类列预处理为数字形式。

import json

import pandas as pd
from sklearn.preprocessing import LabelEncoder

COLUMNS = (
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income-level",
)

# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country",
)

with open("./input.json", "r") as json_lines:
    rows = [json.loads(line) for line in json_lines]

prediction_features = pd.DataFrame(rows, columns=(COLUMNS[:-1]))

encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
    prediction_features[col] = encoders[col].fit_transform(prediction_features[col])

with open("input_numerical.json", "w") as input_numerical:
    for index, row in prediction_features.iterrows():
        input_numerical.write(row.to_json(orient="values") + "\n")

我创建了 this Google Issues Tracker ticket,因为 Google 文档缺少这一重要步骤。