如何使用生成器训练 XGBoost?
How can I train an XGBoost with a generator?
我正在尝试在 python 中堆叠 BERT 张量流模型和 XGBoost 模型。为此,我训练了 BERT 模型,并拥有一个生成器,该生成器从 BERT(预测类别)中获取预测并生成一个列表,该列表是连接到 BERT 预测的分类数据的结果。但是,这不会训练,因为它没有形状。我的代码是:
...
categorical_inputs=df[cat_cols]
y=pd.get_dummies(df[target_col]).values
xgboost_labels=df[target_col].values
concatenated_text_input=df['concatenated_text']
text_model.fit(tf.constant(concatenated_text_input),tf.constant(y), epochs=8)
cat_text_generator=(list(categorical_inputs.iloc[i].values)+list(text_model.predict([concatenated_text_input.iloc[i]])[0]) for i in range(len(categorical_inputs)))
clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
gamma=1)
clf.fit(cat_text_generator, xgboost_labels)
我得到的错误是:
...
-> 1153 if len(X.shape) != 2:
1154 # Simply raise an error here since there might be many
1155 # different ways of reshaping
AttributeError: 'generator' object has no attribute 'shape'
虽然可以创建一个列表或数组来保存数据,但我更喜欢一种解决方案,当有太多数据无法同时保存在内存中时。有没有办法使用生成器来训练 xgboost 模型?
def generator(X_data,y_data,batch_size):
while True:
for step in range(X_data.shape[0]//batch_size):
start=step*batch_size
end=step*(batch_size+1)
current_x=X_data.iloc[start]
current_y=y_data.iloc[start]
#Or if it's an numpy array just get the rows
yield current_x,current_y
Generator=generator(X,y)
batch_size=32
number_of_steps=X.shape[0]//batch_size
clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
gamma=1)
for step in number_of_steps:
X_g,y_g=next(Generator)
clf.fit(X_g, y_g)
您可以将 DeviceQuantileDMatrix
与自定义迭代器一起用作输入。迭代器必须实现 xgboost.core.DataIter
。这是 xgboost 存储库中的示例:
https://github.com/dmlc/xgboost/blob/master/demo/guide-python/quantile_data_iterator.py
我正在尝试在 python 中堆叠 BERT 张量流模型和 XGBoost 模型。为此,我训练了 BERT 模型,并拥有一个生成器,该生成器从 BERT(预测类别)中获取预测并生成一个列表,该列表是连接到 BERT 预测的分类数据的结果。但是,这不会训练,因为它没有形状。我的代码是:
...
categorical_inputs=df[cat_cols]
y=pd.get_dummies(df[target_col]).values
xgboost_labels=df[target_col].values
concatenated_text_input=df['concatenated_text']
text_model.fit(tf.constant(concatenated_text_input),tf.constant(y), epochs=8)
cat_text_generator=(list(categorical_inputs.iloc[i].values)+list(text_model.predict([concatenated_text_input.iloc[i]])[0]) for i in range(len(categorical_inputs)))
clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
gamma=1)
clf.fit(cat_text_generator, xgboost_labels)
我得到的错误是:
...
-> 1153 if len(X.shape) != 2:
1154 # Simply raise an error here since there might be many
1155 # different ways of reshaping
AttributeError: 'generator' object has no attribute 'shape'
虽然可以创建一个列表或数组来保存数据,但我更喜欢一种解决方案,当有太多数据无法同时保存在内存中时。有没有办法使用生成器来训练 xgboost 模型?
def generator(X_data,y_data,batch_size):
while True:
for step in range(X_data.shape[0]//batch_size):
start=step*batch_size
end=step*(batch_size+1)
current_x=X_data.iloc[start]
current_y=y_data.iloc[start]
#Or if it's an numpy array just get the rows
yield current_x,current_y
Generator=generator(X,y)
batch_size=32
number_of_steps=X.shape[0]//batch_size
clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
gamma=1)
for step in number_of_steps:
X_g,y_g=next(Generator)
clf.fit(X_g, y_g)
您可以将 DeviceQuantileDMatrix
与自定义迭代器一起用作输入。迭代器必须实现 xgboost.core.DataIter
。这是 xgboost 存储库中的示例:
https://github.com/dmlc/xgboost/blob/master/demo/guide-python/quantile_data_iterator.py