为什么 lightgbm 的 `categorical_feature` 不起作用?

Why does `categorical_feature` of lightgbm not work?

我想用LightGBM预测房子的tradeMoney,但是当我在[=13=的lgb.Dataset中指定categorical_feature时,我遇到了麻烦].
我得到 data.dtypes 如下:

type(train)
pandas.core.frame.DataFrame

train.dtypes
area                  float64
rentType               object
houseFloor             object
totalFloor              int64
houseToward            object
houseDecoration        object
region                 object
plate                  object
buildYear               int64
saleSecHouseNum         int64
subwayStationNum        int64
busStationNum           int64
interSchoolNum          int64
schoolNum               int64
privateSchoolNum        int64
hospitalNum             int64
drugStoreNum            int64

我用LightGBM训练它如下:

categorical_feats = ['rentType', 'houseFloor', 'houseToward', 'houseDecoration', 'region', 'plate']
folds = KFold(n_splits=5, shuffle=True, random_state=2333)

oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target.values)):
    print("fold {}".format(fold_))
    trn_data = lgb.Dataset(train.iloc[trn_idx], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
    val_data = lgb.Dataset(train.iloc[val_idx], label=target.iloc[val_idx], categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=500, early_stopping_rounds = 200)

    oof_lgb[val_idx] = clf.predict(train.iloc[val_idx], num_iteration=clf.best_iteration)

    predictions_lgb += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits

print("CV Score: {:<8.5f}".format(r2_score(target, oof_lgb)))

但即使我指定了 categorical_features.

,它仍然会给出这样的错误信息

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields rentType, houseFloor, houseToward, houseDecoration, region, plate

这是要求:

LightGBM version: 2.2.3
Pandas version: 0.24.2
Python version: 3.6.8
|Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]

有人能帮帮我吗?

问题是 lightgbm 只能处理 category 类型的特征,而不是 objectHere 提取所有可能的分类特征的列表。这些特征在代码中被编码成整数。但是 objects 没有任何反应,因此 lightgbm 发现并非所有特征都已转换为数字时会抱怨。

所以解决方案是

for c in categorical_feats:
    train[c] = train[c].astype('category')

在你的简历循环之前

在构造数据集之前,您应该将分类特征转换为 int 类型。 您将在 https://lightgbm.readthedocs.io/en/latest/Python-Intro.html 中找到此信息 我遇到过具有分类特征和整数特征的案例,并且出现了同样的错误。解决方案是将所有分类转换为整数。