pyspark,逻辑回归,如何获得各个特征的系数
pyspark, logistic regression, how to get coefficient of respective features
我是 Spark
的新手,我当前的版本是 1.3.1。我想用 PySpark
实现逻辑回归,所以,我从 Spark Python MLlib
中找到了这个例子
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))
而且我发现model
的属性是:
In [21]: model.<TAB>
model.clearThreshold model.predict model.weights
model.intercept model.setThreshold
如何获得逻辑回归的系数?
如您所见,获取系数的方法是使用 LogisticRegressionModel 的属性。
Parameters:
weights – Weights computed for every feature.
intercept – Intercept computed for this model. (Only used in Binary Logistic Regression. In Multinomial Logistic Regression, the
intercepts will not be a single value, so the intercepts will be part
of the weights.)
numFeatures – the dimension of the features.
numClasses – the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default,
it is binary logistic regression so numClasses will be set to 2.
别忘了hθ(x) = 1 / exp ^ -(θ0 + θ1 * x1 + ... + θn * xn)
,其中θ0
代表intercept
,[θ1,...,θn]
代表weights
,特征数是n
.
编辑
如您所见,这是预测的方式,您可以查看 LogisticRegressionModel 的来源。
def predict(self, x):
"""
Predict values for a single data point or an RDD of points
using the model trained.
"""
if isinstance(x, RDD):
return x.map(lambda v: self.predict(v))
x = _convert_to_vector(x)
if self.numClasses == 2:
margin = self.weights.dot(x) + self._intercept
if margin > 0:
prob = 1 / (1 + exp(-margin))
else:
exp_margin = exp(margin)
prob = exp_margin / (1 + exp_margin)
if self._threshold is None:
return prob
else:
return 1 if prob > self._threshold else 0
else:
best_class = 0
max_margin = 0.0
if x.size + 1 == self._dataWithBiasSize:
for i in range(0, self._numClasses - 1):
margin = x.dot(self._weightsMatrix[i][0:x.size]) + \
self._weightsMatrix[i][x.size]
if margin > max_margin:
max_margin = margin
best_class = i + 1
else:
for i in range(0, self._numClasses - 1):
margin = x.dot(self._weightsMatrix[i])
if margin > max_margin:
max_margin = margin
best_class = i + 1
return best_class
我正在使用
model.coefficients
而且有效!
文档:
我是 Spark
的新手,我当前的版本是 1.3.1。我想用 PySpark
实现逻辑回归,所以,我从 Spark Python MLlib
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))
而且我发现model
的属性是:
In [21]: model.<TAB>
model.clearThreshold model.predict model.weights
model.intercept model.setThreshold
如何获得逻辑回归的系数?
如您所见,获取系数的方法是使用 LogisticRegressionModel 的属性。
Parameters:
weights – Weights computed for every feature.
intercept – Intercept computed for this model. (Only used in Binary Logistic Regression. In Multinomial Logistic Regression, the intercepts will not be a single value, so the intercepts will be part of the weights.)
numFeatures – the dimension of the features.
numClasses – the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so numClasses will be set to 2.
别忘了hθ(x) = 1 / exp ^ -(θ0 + θ1 * x1 + ... + θn * xn)
,其中θ0
代表intercept
,[θ1,...,θn]
代表weights
,特征数是n
.
编辑
如您所见,这是预测的方式,您可以查看 LogisticRegressionModel 的来源。
def predict(self, x):
"""
Predict values for a single data point or an RDD of points
using the model trained.
"""
if isinstance(x, RDD):
return x.map(lambda v: self.predict(v))
x = _convert_to_vector(x)
if self.numClasses == 2:
margin = self.weights.dot(x) + self._intercept
if margin > 0:
prob = 1 / (1 + exp(-margin))
else:
exp_margin = exp(margin)
prob = exp_margin / (1 + exp_margin)
if self._threshold is None:
return prob
else:
return 1 if prob > self._threshold else 0
else:
best_class = 0
max_margin = 0.0
if x.size + 1 == self._dataWithBiasSize:
for i in range(0, self._numClasses - 1):
margin = x.dot(self._weightsMatrix[i][0:x.size]) + \
self._weightsMatrix[i][x.size]
if margin > max_margin:
max_margin = margin
best_class = i + 1
else:
for i in range(0, self._numClasses - 1):
margin = x.dot(self._weightsMatrix[i])
if margin > max_margin:
max_margin = margin
best_class = i + 1
return best_class
我正在使用
model.coefficients
而且有效!
文档: