为什么 lightgbm 的 `categorical_feature` 不起作用?
Why does `categorical_feature` of lightgbm not work?
我想用LightGBM
预测房子的tradeMoney
,但是当我在[=13=的lgb.Dataset
中指定categorical_feature
时,我遇到了麻烦].
我得到 data.dtypes
如下:
type(train)
pandas.core.frame.DataFrame
train.dtypes
area float64
rentType object
houseFloor object
totalFloor int64
houseToward object
houseDecoration object
region object
plate object
buildYear int64
saleSecHouseNum int64
subwayStationNum int64
busStationNum int64
interSchoolNum int64
schoolNum int64
privateSchoolNum int64
hospitalNum int64
drugStoreNum int64
我用LightGBM
训练它如下:
categorical_feats = ['rentType', 'houseFloor', 'houseToward', 'houseDecoration', 'region', 'plate']
folds = KFold(n_splits=5, shuffle=True, random_state=2333)
oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target.values)):
print("fold {}".format(fold_))
trn_data = lgb.Dataset(train.iloc[trn_idx], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
val_data = lgb.Dataset(train.iloc[val_idx], label=target.iloc[val_idx], categorical_feature=categorical_feats)
num_round = 10000
clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=500, early_stopping_rounds = 200)
oof_lgb[val_idx] = clf.predict(train.iloc[val_idx], num_iteration=clf.best_iteration)
predictions_lgb += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits
print("CV Score: {:<8.5f}".format(r2_score(target, oof_lgb)))
但即使我指定了 categorical_features
.
,它仍然会给出这样的错误信息
ValueError: DataFrame.dtypes for data must be int, float or bool. Did
not expect the data types in fields rentType, houseFloor, houseToward,
houseDecoration, region, plate
这是要求:
LightGBM version: 2.2.3
Pandas version: 0.24.2
Python version: 3.6.8
|Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit
(AMD64)]
有人能帮帮我吗?
问题是 lightgbm 只能处理 category
类型的特征,而不是 object
。 Here 提取所有可能的分类特征的列表。这些特征在代码中被编码成整数。但是 object
s 没有任何反应,因此 lightgbm
发现并非所有特征都已转换为数字时会抱怨。
所以解决方案是
for c in categorical_feats:
train[c] = train[c].astype('category')
在你的简历循环之前
在构造数据集之前,您应该将分类特征转换为 int 类型。
您将在 https://lightgbm.readthedocs.io/en/latest/Python-Intro.html 中找到此信息
我遇到过具有分类特征和整数特征的案例,并且出现了同样的错误。解决方案是将所有分类转换为整数。
我想用LightGBM
预测房子的tradeMoney
,但是当我在[=13=的lgb.Dataset
中指定categorical_feature
时,我遇到了麻烦].
我得到 data.dtypes
如下:
type(train)
pandas.core.frame.DataFrame
train.dtypes
area float64
rentType object
houseFloor object
totalFloor int64
houseToward object
houseDecoration object
region object
plate object
buildYear int64
saleSecHouseNum int64
subwayStationNum int64
busStationNum int64
interSchoolNum int64
schoolNum int64
privateSchoolNum int64
hospitalNum int64
drugStoreNum int64
我用LightGBM
训练它如下:
categorical_feats = ['rentType', 'houseFloor', 'houseToward', 'houseDecoration', 'region', 'plate']
folds = KFold(n_splits=5, shuffle=True, random_state=2333)
oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target.values)):
print("fold {}".format(fold_))
trn_data = lgb.Dataset(train.iloc[trn_idx], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
val_data = lgb.Dataset(train.iloc[val_idx], label=target.iloc[val_idx], categorical_feature=categorical_feats)
num_round = 10000
clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=500, early_stopping_rounds = 200)
oof_lgb[val_idx] = clf.predict(train.iloc[val_idx], num_iteration=clf.best_iteration)
predictions_lgb += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits
print("CV Score: {:<8.5f}".format(r2_score(target, oof_lgb)))
但即使我指定了 categorical_features
.
ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields rentType, houseFloor, houseToward, houseDecoration, region, plate
这是要求:
LightGBM version: 2.2.3
Pandas version: 0.24.2
Python version: 3.6.8
|Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
有人能帮帮我吗?
问题是 lightgbm 只能处理 category
类型的特征,而不是 object
。 Here 提取所有可能的分类特征的列表。这些特征在代码中被编码成整数。但是 object
s 没有任何反应,因此 lightgbm
发现并非所有特征都已转换为数字时会抱怨。
所以解决方案是
for c in categorical_feats:
train[c] = train[c].astype('category')
在你的简历循环之前
在构造数据集之前,您应该将分类特征转换为 int 类型。 您将在 https://lightgbm.readthedocs.io/en/latest/Python-Intro.html 中找到此信息 我遇到过具有分类特征和整数特征的案例,并且出现了同样的错误。解决方案是将所有分类转换为整数。