Spark mllib - NaiveBayes weightcol 参数影响和格式

Question

我目前正在对我的数据尝试多种算法以确定哪种算法最好。我还研究了如何自定义所述算法并使用 CrossValidator 对象来测试这些参数。

我卡在 NaiveBayes 和 WeightCol 参数上。

我找不到关于它的任何信息，它是如何工作的以及如何设置它。代码中的评论说："If this is not set or empty, we treat all instance weights as 1.0" 所以我想我可以使用像 "mycolumn=1.0,myothercol=2.0" 这样的值，但无论我尝试什么，我总是在 return.

中出错

我唯一没有错误的情况是使用 "mycolumn" 作为值，但我不知道它的效果是什么。

如果有人知道使用此参数，我将不胜感激。

谢谢

Answer 1

weight Param 应该是双倍的，用于确定样本的重要性，例如纠正偏斜的标签分布。

假设您有这样的数据：

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

val skewed = data
  .where($"label" === 0.0).limit(5)
  .union(data.where($"label" === 1.0))

skewed.groupBy($"label").count.show

+-----+-----+
|label|count|
+-----+-----+
|  0.0|    5|
|  1.0|   57|
+-----+-----+

我们可以为 label 等于 0.0 的记录增加更高的权重：

val weighted = skewed
  .withColumn("weight", when($"label" === 0.0, 1.0).otherwise(0.1))

val weightedModel = new NaiveBayes().setWeightCol("weight").fit(weighted)

weightedModel.transform(weighted.where($"label" === 0.0)).show

+-----+--------------------+------+--------------------+-----------+----------+
|label|            features|weight|       rawPrediction|probability|prediction|
+-----+--------------------+------+--------------------+-----------+----------+
|  0.0|(692,[127,128,129...|   1.0|[-165013.81130787...|  [1.0,0.0]|       0.0|
|  0.0|(692,[129,130,131...|   1.0|[-191959.02863649...|  [1.0,0.0]|       0.0|
|  0.0|(692,[154,155,156...|   1.0|[-201850.30335886...|  [1.0,0.0]|       0.0|
|  0.0|(692,[127,128,129...|   1.0|[-202315.73236242...|  [1.0,0.0]|       0.0|
|  0.0|(692,[153,154,155...|   1.0|[-258710.53340756...|  [1.0,0.0]|       0.0|
+-----+--------------------+------+--------------------+-----------+----------+

要缩放特征向量，您可以使用 ElementwiseProduct。

Spark mllib - NaiveBayes weightcol 参数影响和格式

Spark mllib - NaiveBayes weightcol parameter influence and format

apache-spark

apache-spark-mllib