LogisticRegressionModel 手动预测
LogisticRegressionModel prediction manually
我试图预测 DataFrame
中每一行的标签,但没有使用LinearRegressionModel 的transform
方法,由于别有用心,我尝试使用经典公式 1 / (1 + e^(-hθ(x)))
手动计算它,请注意,我从 Apache Spark
的存储库中复制了代码,并从 private
中复制了几乎所有内容object BLAS
变成它的 public 版本。
PD:我没有使用任何regParam
,我只是拟合模型。
//Notice that I had to obtain intercept, and coefficients from my model
val intercept = model.intercept
val coefficients = model.coefficients
val margin: Vector => Double = (features) => {
BLAS.dot(features, coefficients) + intercept
}
val score: Vector => Double = (features) => {
val m = margin(features)
1.0 / (1.0 + math.exp(-m))
}
定义这些函数并获取模型参数后,我创建了一个 UDF
来计算预测(它接收与 DenseVector
相同的特征),稍后我将我的预测与真实模型的预测进行比较而且它们 非常不同! 那么我错过了什么?我做错了什么?
val predict = udf((v: DenseVector) => {
val recency = v(0)
val frequency = v(1)
val tp = score(new DenseVector(Array(recency, frequency)))
new DenseVector(Array(tp, 1 - tp))
})
// model's predictions
val xf = model.transform(df)
df.select(col("id"), predict(col("features")).as("myprediction"))
.join(xf, df("id") === xf("id"), "inner")
.select(df("id"), col("probability"), col("myprediction"))
.show
+----+--------------------+--------------------+
| id| probability| myprediction|
+----+--------------------+--------------------+
| 31|[0.97579780436514...|[0.98855386037790...|
| 231|[0.97579780436514...|[0.98855386037790...|
| 431|[0.69794428333266...| [1.0,0.0]|
| 631|[0.97579780436514...|[0.98855386037790...|
| 831|[0.97579780436514...|[0.98855386037790...|
|1031|[0.96509616791398...|[0.99917463322937...|
|1231|[0.96509616791398...|[0.99917463322937...|
|1431|[0.96509616791398...|[0.99917463322937...|
|1631|[0.94231815700848...|[0.99999999999999...|
|1831|[0.96509616791398...|[0.99917463322937...|
|2031|[0.96509616791398...|[0.99917463322937...|
|2231|[0.96509616791398...|[0.99917463322937...|
|2431|[0.95353743438055...| [1.0,0.0]|
|2631|[0.94646924057674...| [1.0,0.0]|
|2831|[0.96509616791398...|[0.99917463322937...|
|3031|[0.96509616791398...|[0.99917463322937...|
|3231|[0.95971207153567...|[0.99999999999996...|
|3431|[0.96509616791398...|[0.99917463322937...|
|3631|[0.96509616791398...|[0.99917463322937...|
|3831|[0.96509616791398...|[0.99917463322937...|
+----+--------------------+--------------------+
编辑
我什至尝试在 udf
中定义此类函数,但没有成功。
def predict(coefficients: Vector, intercept: Double) = {
udf((v: DenseVector) => {
def margin(features: Vector, coefficients: Vector, intercept: Double): Double = {
BLAS.dot(features, coefficients) + intercept
}
def score(features: Vector, coefficients: Vector, intercept: Double): Double = {
val m = margin(features, coefficients, intercept)
1.0 / (1.0 + math.exp(-m))
}
val recency = v(0)
val frequency = v(1)
val tp = score(new DenseVector(Array(recency, frequency)), coefficients, intercept)
new DenseVector(Array(tp, 1 - tp))
})
}
很尴尬,其实问题是因为我用了一个Pipeline
,加了一个MinMaxScaler
作为stage,所以数据集在模型训练之前就被缩放了,所以两个参数coefficients
和 intercept
与 缩放数据 相关联,因此当我使用它们计算预测时,结果完全有偏差。因此,为了解决这个问题,我只是对训练数据集进行了非标准化,这样我就可以获得那些 coefficients
和 intercept
。重新执行代码后,我得到了与Spark
相同的结果。另一方面,我听取了 @zero323 的建议并将 margin
和 score
定义移到了 udf
的第一个 [=20] 中=]声明。
我试图预测 DataFrame
中每一行的标签,但没有使用LinearRegressionModel 的transform
方法,由于别有用心,我尝试使用经典公式 1 / (1 + e^(-hθ(x)))
手动计算它,请注意,我从 Apache Spark
的存储库中复制了代码,并从 private
中复制了几乎所有内容object BLAS
变成它的 public 版本。
PD:我没有使用任何regParam
,我只是拟合模型。
//Notice that I had to obtain intercept, and coefficients from my model
val intercept = model.intercept
val coefficients = model.coefficients
val margin: Vector => Double = (features) => {
BLAS.dot(features, coefficients) + intercept
}
val score: Vector => Double = (features) => {
val m = margin(features)
1.0 / (1.0 + math.exp(-m))
}
定义这些函数并获取模型参数后,我创建了一个 UDF
来计算预测(它接收与 DenseVector
相同的特征),稍后我将我的预测与真实模型的预测进行比较而且它们 非常不同! 那么我错过了什么?我做错了什么?
val predict = udf((v: DenseVector) => {
val recency = v(0)
val frequency = v(1)
val tp = score(new DenseVector(Array(recency, frequency)))
new DenseVector(Array(tp, 1 - tp))
})
// model's predictions
val xf = model.transform(df)
df.select(col("id"), predict(col("features")).as("myprediction"))
.join(xf, df("id") === xf("id"), "inner")
.select(df("id"), col("probability"), col("myprediction"))
.show
+----+--------------------+--------------------+
| id| probability| myprediction|
+----+--------------------+--------------------+
| 31|[0.97579780436514...|[0.98855386037790...|
| 231|[0.97579780436514...|[0.98855386037790...|
| 431|[0.69794428333266...| [1.0,0.0]|
| 631|[0.97579780436514...|[0.98855386037790...|
| 831|[0.97579780436514...|[0.98855386037790...|
|1031|[0.96509616791398...|[0.99917463322937...|
|1231|[0.96509616791398...|[0.99917463322937...|
|1431|[0.96509616791398...|[0.99917463322937...|
|1631|[0.94231815700848...|[0.99999999999999...|
|1831|[0.96509616791398...|[0.99917463322937...|
|2031|[0.96509616791398...|[0.99917463322937...|
|2231|[0.96509616791398...|[0.99917463322937...|
|2431|[0.95353743438055...| [1.0,0.0]|
|2631|[0.94646924057674...| [1.0,0.0]|
|2831|[0.96509616791398...|[0.99917463322937...|
|3031|[0.96509616791398...|[0.99917463322937...|
|3231|[0.95971207153567...|[0.99999999999996...|
|3431|[0.96509616791398...|[0.99917463322937...|
|3631|[0.96509616791398...|[0.99917463322937...|
|3831|[0.96509616791398...|[0.99917463322937...|
+----+--------------------+--------------------+
编辑
我什至尝试在 udf
中定义此类函数,但没有成功。
def predict(coefficients: Vector, intercept: Double) = {
udf((v: DenseVector) => {
def margin(features: Vector, coefficients: Vector, intercept: Double): Double = {
BLAS.dot(features, coefficients) + intercept
}
def score(features: Vector, coefficients: Vector, intercept: Double): Double = {
val m = margin(features, coefficients, intercept)
1.0 / (1.0 + math.exp(-m))
}
val recency = v(0)
val frequency = v(1)
val tp = score(new DenseVector(Array(recency, frequency)), coefficients, intercept)
new DenseVector(Array(tp, 1 - tp))
})
}
很尴尬,其实问题是因为我用了一个Pipeline
,加了一个MinMaxScaler
作为stage,所以数据集在模型训练之前就被缩放了,所以两个参数coefficients
和 intercept
与 缩放数据 相关联,因此当我使用它们计算预测时,结果完全有偏差。因此,为了解决这个问题,我只是对训练数据集进行了非标准化,这样我就可以获得那些 coefficients
和 intercept
。重新执行代码后,我得到了与Spark
相同的结果。另一方面,我听取了 @zero323 的建议并将 margin
和 score
定义移到了 udf
的第一个 [=20] 中=]声明。