SparkR summary() 提取

SparkR summary() extracting

我对 SparkR 中使用随机森林回归的 summary() 方法有疑问。模型构建过程运行良好,但我对算法结果之一的 featureImportance 很感兴趣。我想将 featureImportance 变量存储到 SparkDataFrame 中以可视化它们,但我不知道如何 transfer/extract 它。

model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")

summaryRF <- summary(model)

summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'

summaryRF$featureImportances: 
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'

是否有任何解决方案可以从列表对象中获取 featureImportance 值并将其存储在 SparkDataFrame 中?

使用 collect() 方法给出以下错误代码:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"character"’

summaryRF 不再是 SparkDataFrame,这就是为什么 collect 不起作用 :)

summaryRF$featureImportances 是一个 character string(在 Spark 方面它是一个 SparseVector,目前(v. 2.1.0)无法序列化来回R,我想这就是它被强制转换为 string) 的原因。

据我所知,您必须通过直接操作字符串来提取相关位:

# extract the feature indexes and feature importances strings:
fimpList <- strsplit(gsub("\(.*?\[","",summaryRF$featureImportances),"\],\[")

# split the index and feature importances strings into vectors (and remove "])" from the last record):
fimp <- lapply(fimpList, function(x) strsplit(gsub("\]\)","",x),","))

# it's now a list of lists, but you can make this into a dataframe if you like:
fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))

eta:顺便说一句,Spark 中的索引从 0 开始,所以如果你想在 summaryRf$features 中加入特征名称时合并 summaryRF$featureImportances 中的特征索引,你必须考虑到这一点:

featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features),
                                  featureIndex = c(0:(length(summaryRf$features)-1))),
                                  stringsAsFactors = FALSE)