火花R。如何计算 Spark DataFrame 中所有列的不同值？

Question

我想知道是否有办法计算 spark 数据帧每一列中不同项目的数量？也就是说，给定此数据集：

set.seed(123)
df<- data.frame(ColA=rep(c("dog", "cat", "fish", "shark"), 4), ColB=rnorm(16), ColC=rep(seq(1:8),2))
df

我在 R 中执行此操作以获取计数：

sapply(df, function(x){length(unique(x))} )

> ColA ColB ColC 
   4   16    8

我将如何为这个 Spark DataFrame 做同样的事情？

sdf<- SparkR::createDataFrame(df)

非常感谢任何帮助。先感谢您。 -nate

Answer 1

这在 SparkR 对我有用：

exprs = lapply(names(sdf), function(x) alias(countDistinct(sdf[[x]]), x))
# here use do.call to splice the aggregation expressions to agg function
head(do.call(agg, c(x = sdf, exprs)))

#  ColA ColB ColC
#1    4   16    8

火花R。如何计算 Spark DataFrame 中所有列的不同值？

SparkR. How to count distinct values for all columns in a Spark DataFrame?

r

apache-spark

sparkr