SparkR summary() 提取
SparkR summary() extracting
我对 SparkR 中使用随机森林回归的 summary() 方法有疑问。模型构建过程运行良好,但我对算法结果之一的 featureImportance 很感兴趣。我想将 featureImportance 变量存储到 SparkDataFrame 中以可视化它们,但我不知道如何 transfer/extract 它。
model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")
summaryRF <- summary(model)
summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'
summaryRF$featureImportances:
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'
是否有任何解决方案可以从列表对象中获取 featureImportance 值并将其存储在 SparkDataFrame 中?
使用 collect() 方法给出以下错误代码:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"character"’
summaryRF
不再是 SparkDataFrame
,这就是为什么 collect
不起作用 :)
summaryRF$featureImportances
是一个 character string
(在 Spark
方面它是一个 SparseVector
,目前(v. 2.1.0)无法序列化来回R
,我想这就是它被强制转换为 string
) 的原因。
据我所知,您必须通过直接操作字符串来提取相关位:
# extract the feature indexes and feature importances strings:
fimpList <- strsplit(gsub("\(.*?\[","",summaryRF$featureImportances),"\],\[")
# split the index and feature importances strings into vectors (and remove "])" from the last record):
fimp <- lapply(fimpList, function(x) strsplit(gsub("\]\)","",x),","))
# it's now a list of lists, but you can make this into a dataframe if you like:
fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))
eta:顺便说一句,Spark
中的索引从 0 开始,所以如果你想在 summaryRf$features
中加入特征名称时合并 summaryRF$featureImportances
中的特征索引,你必须考虑到这一点:
featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features),
featureIndex = c(0:(length(summaryRf$features)-1))),
stringsAsFactors = FALSE)
我对 SparkR 中使用随机森林回归的 summary() 方法有疑问。模型构建过程运行良好,但我对算法结果之一的 featureImportance 很感兴趣。我想将 featureImportance 变量存储到 SparkDataFrame 中以可视化它们,但我不知道如何 transfer/extract 它。
model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")
summaryRF <- summary(model)
summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'
summaryRF$featureImportances:
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'
是否有任何解决方案可以从列表对象中获取 featureImportance 值并将其存储在 SparkDataFrame 中?
使用 collect() 方法给出以下错误代码:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"character"’
summaryRF
不再是 SparkDataFrame
,这就是为什么 collect
不起作用 :)
summaryRF$featureImportances
是一个 character string
(在 Spark
方面它是一个 SparseVector
,目前(v. 2.1.0)无法序列化来回R
,我想这就是它被强制转换为 string
) 的原因。
据我所知,您必须通过直接操作字符串来提取相关位:
# extract the feature indexes and feature importances strings:
fimpList <- strsplit(gsub("\(.*?\[","",summaryRF$featureImportances),"\],\[")
# split the index and feature importances strings into vectors (and remove "])" from the last record):
fimp <- lapply(fimpList, function(x) strsplit(gsub("\]\)","",x),","))
# it's now a list of lists, but you can make this into a dataframe if you like:
fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))
eta:顺便说一句,Spark
中的索引从 0 开始,所以如果你想在 summaryRf$features
中加入特征名称时合并 summaryRF$featureImportances
中的特征索引,你必须考虑到这一点:
featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features),
featureIndex = c(0:(length(summaryRf$features)-1))),
stringsAsFactors = FALSE)