pyspark,逻辑回归,如何获得各个特征的系数

pyspark, logistic regression, how to get coefficient of respective features

我是 Spark 的新手,我当前的版本是 1.3.1。我想用 PySpark 实现逻辑回归,所以,我从 Spark Python MLlib

中找到了这个例子
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)

# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)

# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

而且我发现model的属性是:

In [21]: model.<TAB>
model.clearThreshold  model.predict         model.weights
model.intercept       model.setThreshold  

如何获得逻辑回归的系数?

如您所见,获取系数的方法是使用 LogisticRegressionModel 的属性。

Parameters:

weights – Weights computed for every feature.

intercept – Intercept computed for this model. (Only used in Binary Logistic Regression. In Multinomial Logistic Regression, the intercepts will not be a single value, so the intercepts will be part of the weights.)

numFeatures – the dimension of the features.

numClasses – the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so numClasses will be set to 2.

别忘了hθ(x) = 1 / exp ^ -(θ0 + θ1 * x1 + ... + θn * xn),其中θ0代表intercept[θ1,...,θn]代表weights,特征数是n.

编辑

如您所见,这是预测的方式,您可以查看 LogisticRegressionModel 的来源。

def predict(self, x):
    """
    Predict values for a single data point or an RDD of points
    using the model trained.
    """
    if isinstance(x, RDD):
        return x.map(lambda v: self.predict(v))

    x = _convert_to_vector(x)
    if self.numClasses == 2:
        margin = self.weights.dot(x) + self._intercept
        if margin > 0:
            prob = 1 / (1 + exp(-margin))
        else:
            exp_margin = exp(margin)
            prob = exp_margin / (1 + exp_margin)
        if self._threshold is None:
            return prob
        else:
            return 1 if prob > self._threshold else 0
    else:
        best_class = 0
        max_margin = 0.0
        if x.size + 1 == self._dataWithBiasSize:
            for i in range(0, self._numClasses - 1):
                margin = x.dot(self._weightsMatrix[i][0:x.size]) + \
                    self._weightsMatrix[i][x.size]
                if margin > max_margin:
                    max_margin = margin
                    best_class = i + 1
        else:
            for i in range(0, self._numClasses - 1):
                margin = x.dot(self._weightsMatrix[i])
                if margin > max_margin:
                    max_margin = margin
                    best_class = i + 1
        return best_class

我正在使用

model.coefficients

而且有效!

文档:

https://spark.apache.org/docs/2.4.5/api/python/pyspark.ml.html?highlight=coefficients#pyspark.ml.classification.LogisticRegressionModel.coefficients