PySpark 多项回归中的参考组

Reference group in PySpark multinomial regression

有谁知道 Pyspark 多项逻辑回归中的默认参考组是什么？例如，我们有 A, B, C, and D 的多类 outcomes/target。

spark如何选择参考类别？在其他软件（如R、SAS）中的标准逻辑回归中，您可以自己设置参考组。因此，如果您的参考是 A，您会得到 n-1 个模型，并将目标类建模为 A vs B, A vs C, and A vs D。

您想要控制此过程，因为如果将具有少量值（小样本观察）的结果设置为参考，则估计将不稳定。

Here is the link 到 pyspark 中的多项逻辑回归模型。这里的结果类是 0、1、2，但不清楚引用是什么。我假设它可能为零但不确定。

我相信默认情况下，它不使用引用组。这就是为什么，如果您运行来自 link 的片段，您会发现所有截距都为非零值。

来自 scala 源代码： https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala

Note that there is a difference between multinomial (softmax) and binary loss. The binary case

uses one outcome class as a "pivot" and regresses the other class against the pivot. In the

multinomial case, the softmax loss function is used to model each class probability

independently. Using softmax loss produces K sets of coefficients, while using a pivot class

produces K - 1 sets of coefficients (a single coefficient vector in the binary case). In the

binary case, we can say that the coefficients are shared between the positive and negative

classes...

它继续讨论系数通常是如何不可识别的（这就是人们会选择参考标签的原因），但是当应用正则化时，系数确实变得可识别。

PySpark 多项回归中的参考组

Reference group in PySpark multinomial regression

multinomial

logistic-regression

apache-spark

pyspark

multiclass-classification