SparkR summary() 提取

Question

我对 SparkR 中使用随机森林回归的 summary() 方法有疑问。模型构建过程运行良好，但我对算法结果之一的 featureImportance 很感兴趣。我想将 featureImportance 变量存储到 SparkDataFrame 中以可视化它们，但我不知道如何 transfer/extract 它。

model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")

summaryRF <- summary(model)

summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'

summaryRF$featureImportances: 
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'

是否有任何解决方案可以从列表对象中获取 featureImportance 值并将其存储在 SparkDataFrame 中？

使用 collect() 方法给出以下错误代码：

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"character"’

Answer 1

summaryRF 不再是 SparkDataFrame，这就是为什么 collect 不起作用 :)

summaryRF$featureImportances 是一个 character string（在 Spark 方面它是一个 SparseVector，目前（v. 2.1.0）无法序列化来回R，我想这就是它被强制转换为 string) 的原因。

据我所知，您必须通过直接操作字符串来提取相关位：

# extract the feature indexes and feature importances strings:
fimpList <- strsplit(gsub("\(.*?\[","",summaryRF$featureImportances),"\],\[")

# split the index and feature importances strings into vectors (and remove "])" from the last record):
fimp <- lapply(fimpList, function(x) strsplit(gsub("\]\)","",x),","))

# it's now a list of lists, but you can make this into a dataframe if you like:
fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))

eta：顺便说一句，Spark 中的索引从 0 开始，所以如果你想在 summaryRf$features 中加入特征名称时合并 summaryRF$featureImportances 中的特征索引，你必须考虑到这一点：

featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features),
                                  featureIndex = c(0:(length(summaryRf$features)-1))),
                                  stringsAsFactors = FALSE)

SparkR summary() 提取

SparkR summary() extracting

r

apache-spark

sparkr