为什么 LogisticRegressionModel 无法对 libsvm 数据进行评分？

Question

Load the data that you want score. The data is stored in libsvm format in the following manner: label index1:value1 index2:value2 ... (the indices are one-based and in ascending order) Here is the sample data
100 10:1 11:1 208:1 400:1 1830:1

 val unseendata: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,unseendatafileName)
    val scores_path = results_base + run_id + "/"  + "-scores"
// Load the saved model
    val lrm = LogisticRegressionModel.load(sc,"logisticregressionmodels/mymodel")

    // I had saved the model after the training using save method. Here is the metadate for that model LogisticRegressionModel/mymodel/metadata/part-00000
{"class":"org.apache.spark.mllib.classification.LogisticRegressionModel","version":"1.0","numFeatures":176894,"numClasses":2}

      // Evaluate model on unseen data
       var valuesAndPreds = unseendata.map { point =>
       var prediction = lrm.predict(point.features)
        (point.label, prediction)
    }

// Store the scores
    valuesAndPreds.saveAsTextFile(scores_path)

这是我收到的错误消息：

16/04/28 10:22:07 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 5, ): java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:221) at org.apache.spark.mllib.classification.LogisticRegressionModel.predictPoint(LogisticRegression.scala:105) at org.apache.spark.mllib.regression.GeneralizedLinearModel.predict(GeneralizedLinearAlgorithm.scala:76)

Answer 1

抛出异常的代码是require(dataMatrix.size == numFeatures).

我的猜测是该模型适合 176894 特征（参见模型输出中的 "numFeatures":176894），而 libsvm 文件只有1830 个特征。数字必须匹配。

将加载 libsvm 的行更改为：

val unseendata = MLUtils.loadLibSVMFile(sc, unseendatafileName, 176894)

为什么 LogisticRegressionModel 无法对 libsvm 数据进行评分？

Why does LogisticRegressionModel fail at scoring of libsvm data?

apache-spark

apache-spark-ml

apache-spark-mllib