SparkR Glm模型
SparkR Glm model
我想在 SparkR 上使用 glm。
这是官方 Spark 文档中的示例
df <- createDataFrame(sqlContext, iris)
# Fit a gaussian GLM model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
# Model summary are returned in a similar format to R's native glm().
summary(model)
##$devianceResiduals
## Min Max
## -1.307112 1.412532
##
##$coefficients
## Estimate Std. Error t value Pr(>|t|)
##(Intercept) 2.251393 0.3697543 6.08889 9.568102e-09
##Sepal_Width 0.8035609 0.106339 7.556598 4.187317e-12
##Species_versicolor 1.458743 0.1121079 13.01195 0
##Species_virginica 1.946817 0.100015 19.46525 0
# Make predictions based on the model.
predictions <- predict(model, newData = df)
head(select(predictions, "Sepal_Length", "prediction"))
## Sepal_Length prediction
##1 5.1 5.063856
##2 4.9 4.662076
##3 4.7 4.822788
##4 4.6 4.742432
##5 5.0 5.144212
##6 5.4 5.385281
当我对模型进行总结时,我得到了这个:
summary(model)
Length Class Mode
1 PipelineModel S4
model
An object of class "PipelineModel"
Slot "model":
Java ref type org.apache.spark.ml.PipelineModel id 188
这是什么意思?如何查看示例中解释的摘要结果?
其次,
我在另一个数据集上尝试了另一个来自数据块的例子。
training<-createDataFrame(sqlContext,training)
#|-- LINESET: string (nullable = true)
#|-- TIMEINTERVAL: integer (nullable = true)
#|-- SmsIn: double (nullable = true)
#|-- SmsOut: double (nullable = true)
#|-- CallIn: double (nullable = true)
#|-- CallOut: double (nullable = true)
#|-- Internet: double (nullable = true)
#|-- ValueAmp: double (nullable = true)
model <- glm(ValueAmp ~ TIMEINTERVAL + LINESET,
family = "gaussian", data =training)
summary(model)
preds<- predict(model,training)
errors <- select(
preds, preds$label, preds$prediction, preds$LINESET,
alias(preds$label - preds$prediction, "error"))
display(sample(errors, F, .0001))
**Errore in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘display’ for signature ‘"DataFrame"**’
如何解决显示错误?为什么在数据块上它有效?
how can I see those results from summary that are explained in the example?
Spark 1.6.0 中引入了类似 R 的摘要(参见 SPARK-11473)。看来您使用的是早期版本。
How can I solve the errors on display? Why on databricks it works?
它在本地不起作用,因为 display
不是 Spark /SparkR 函数,而是 DataBricks 平台的专有功能。
我想在 SparkR 上使用 glm。
这是官方 Spark 文档中的示例
df <- createDataFrame(sqlContext, iris)
# Fit a gaussian GLM model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
# Model summary are returned in a similar format to R's native glm().
summary(model)
##$devianceResiduals
## Min Max
## -1.307112 1.412532
##
##$coefficients
## Estimate Std. Error t value Pr(>|t|)
##(Intercept) 2.251393 0.3697543 6.08889 9.568102e-09
##Sepal_Width 0.8035609 0.106339 7.556598 4.187317e-12
##Species_versicolor 1.458743 0.1121079 13.01195 0
##Species_virginica 1.946817 0.100015 19.46525 0
# Make predictions based on the model.
predictions <- predict(model, newData = df)
head(select(predictions, "Sepal_Length", "prediction"))
## Sepal_Length prediction
##1 5.1 5.063856
##2 4.9 4.662076
##3 4.7 4.822788
##4 4.6 4.742432
##5 5.0 5.144212
##6 5.4 5.385281
当我对模型进行总结时,我得到了这个:
summary(model)
Length Class Mode
1 PipelineModel S4
model
An object of class "PipelineModel"
Slot "model":
Java ref type org.apache.spark.ml.PipelineModel id 188
这是什么意思?如何查看示例中解释的摘要结果?
其次, 我在另一个数据集上尝试了另一个来自数据块的例子。
training<-createDataFrame(sqlContext,training)
#|-- LINESET: string (nullable = true)
#|-- TIMEINTERVAL: integer (nullable = true)
#|-- SmsIn: double (nullable = true)
#|-- SmsOut: double (nullable = true)
#|-- CallIn: double (nullable = true)
#|-- CallOut: double (nullable = true)
#|-- Internet: double (nullable = true)
#|-- ValueAmp: double (nullable = true)
model <- glm(ValueAmp ~ TIMEINTERVAL + LINESET,
family = "gaussian", data =training)
summary(model)
preds<- predict(model,training)
errors <- select(
preds, preds$label, preds$prediction, preds$LINESET,
alias(preds$label - preds$prediction, "error"))
display(sample(errors, F, .0001))
**Errore in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘display’ for signature ‘"DataFrame"**’
如何解决显示错误?为什么在数据块上它有效?
how can I see those results from summary that are explained in the example?
Spark 1.6.0 中引入了类似 R 的摘要(参见 SPARK-11473)。看来您使用的是早期版本。
How can I solve the errors on display? Why on databricks it works?
它在本地不起作用,因为 display
不是 Spark /SparkR 函数,而是 DataBricks 平台的专有功能。