如何使用生成器训练 XGBoost？

Question

我正在尝试在 python 中堆叠 BERT 张量流模型和 XGBoost 模型。为此，我训练了 BERT 模型，并拥有一个生成器，该生成器从 BERT（预测类别）中获取预测并生成一个列表，该列表是连接到 BERT 预测的分类数据的结果。但是，这不会训练，因为它没有形状。我的代码是：

...
categorical_inputs=df[cat_cols]
y=pd.get_dummies(df[target_col]).values
xgboost_labels=df[target_col].values
concatenated_text_input=df['concatenated_text']
text_model.fit(tf.constant(concatenated_text_input),tf.constant(y), epochs=8)
cat_text_generator=(list(categorical_inputs.iloc[i].values)+list(text_model.predict([concatenated_text_input.iloc[i]])[0]) for i in range(len(categorical_inputs)))


clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
                       gamma=1)
clf.fit(cat_text_generator, xgboost_labels)

我得到的错误是：

...
-> 1153         if len(X.shape) != 2:
   1154             # Simply raise an error here since there might be many
   1155             # different ways of reshaping

AttributeError: 'generator' object has no attribute 'shape'

虽然可以创建一个列表或数组来保存数据，但我更喜欢一种解决方案，当有太多数据无法同时保存在内存中时。有没有办法使用生成器来训练 xgboost 模型？

Answer 1

def generator(X_data,y_data,batch_size):
    while True:
      for step in range(X_data.shape[0]//batch_size):
          start=step*batch_size
          end=step*(batch_size+1)
          current_x=X_data.iloc[start]
          current_y=y_data.iloc[start] 
          #Or if it's an numpy array just get the rows
          yield current_x,current_y

Generator=generator(X,y)
batch_size=32
number_of_steps=X.shape[0]//batch_size

clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
                       gamma=1)
 
for step in number_of_steps:
    X_g,y_g=next(Generator)
    clf.fit(X_g, y_g)

Answer 2

您可以将 DeviceQuantileDMatrix 与自定义迭代器一起用作输入。迭代器必须实现 xgboost.core.DataIter。这是 xgboost 存储库中的示例：

https://github.com/dmlc/xgboost/blob/master/demo/guide-python/quantile_data_iterator.py

如何使用生成器训练 XGBoost？

How can I train an XGBoost with a generator?

python

python-3.x

xgboost

tensorflow

bert-language-model