XGBoost4j-spark 预测来自局部模型的稀疏向量

XGBoost4j-spark prediction on sparse vector from local model

我在 Databricks 上 运行。我正在尝试使用在 scala 中使用 xgboost4j-spark 在 R 中本地训练的 xgboost 模型进行分布式预测。数据位于 Dataframe 中,具有来自 org.apache.spark.ml.linalg.Vectors.sparse 的稀疏向量的特征列。我已经成功地用这种格式的数据训练了一个不相关的模型。

数据如下所示:

train_sparse.filter("ID == 1").show(false)
+-----------+------------------------------------------+
|ID|feature_vector                            |
+-----------+------------------------------------------+
|1          |(4056,[0,1,1097,2250],[26.0,1.0,1.0,57.0])|
+-----------+------------------------------------------+

必须先创建桥 class 才能在本地模型中加载。

%scala
package ml.dmlc.xgboost4j.scala.spark2
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
class XGBoostRegBridge(
    uid: String,
    _booster: Booster) {
  val xgbRegressionModel = new XGBoostRegressionModel(uid, _booster)
}

import ml.dmlc.xgboost4j.scala.spark2._
import ml.dmlc.xgboost4j.scala.XGBoost
val model = XGBoost.loadModel("/dbfs/FileStore/tmp/xgb53.model")
val bri = new XGBoostRegBridge("uid", model)
bri.xgbRegressionModel.setFeaturesCol("feature_vector")
var pred = bri.xgbRegressionModel.transform(train_sparse)
pred.show()

Job aborted due to stage failure.
Caused by: XGBoostError: [17:36:06] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:159: [17:36:06] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:78: Check failed: jenv->ExceptionOccurred(): 
Stack trace:
  [bt] (0) /local_disk0/tmp/libxgboost4j3687488462117693459.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f0ff8810843]
  [bt] (1) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGBoost4jCallbackDataIterNext+0xd10) [0x7f0ff880d960]
  [bt] (2) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int)+0x2f8) [0x7f0ff8902268]
  [bt] (3) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, unsigned long)+0x45) [0x7f0ff88f79b5]
  [bt] (4) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGDMatrixCreateFromDataIter+0x152) [0x7f0ff881e682]
  [bt] (5) /local_disk0/tmp/libxgboost4j3687488462117693459.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f0ff880b7b6]
  [bt] (6) [0x7f1020017ee7]


Stack trace:
  [bt] (0) /local_disk0/tmp/libxgboost4j3687488462117693459.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f0ff8810843]
  [bt] (1) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGBoost4jCallbackDataIterNext+0xdc4) [0x7f0ff880da14]
  [bt] (2) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int)+0x2f8) [0x7f0ff8902268]
  [bt] (3) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, unsigned long)+0x45) [0x7f0ff88f79b5]
  [bt] (4) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGDMatrixCreateFromDataIter+0x152) [0x7f0ff881e682]
  [bt] (5) /local_disk0/tmp/libxgboost4j3687488462117693459.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f0ff880b7b6]
  [bt] (6) [0x7f1020017ee7]

这是某种类型的迭代器错误,但我没有使用自定义迭代器。

刚好需要 bri.xgbRegressionModel.setMissing(0.0F) 现在可以使用了