SparkR

Question

我想获得关于我的数据框的一些描述性统计信息：

# Initialize SparkR Contexts
    library(SparkR)                                 # Load library
    sc <- sparkR.init(master="local[4]")            # Initialize Spark Context
    sqlContext <- sparkRSQL.init(sc)                # Initialize SQL Context

# Load data
df <- loadDF(sqlContext, "/outputs/merged.parquet") # Load data into Data Frame

# Filter 
df_t1 <- select(filter(df, df$t == 1 & df$totalUsers > 0 & isNotNull(df$domain)), "*")

avg_df <- collect(agg(groupBy(df_t1, "domain"), AVG=avg(df_t1$totalUsers), STD=sd(df_t1$totalUsers, na.rm = FALSE)))
head(avg_df)

我收到这个错误：

Error in as.double(x) : 
  cannot coerce type 'S4' to vector of type 'double'

由sd()制作。我尝试使用 var() 并得到 Error: is.atomic(x) is not TRUE。仅使用 avg().

时我没有收到任何错误

我的问题与 this one because I am not using these packages, and reading 不同我知道出于某种原因我的 df_t1$tutoalUsers 是一个类型 S4 而不是双精度向量，所以我尝试转换它但没有效果：

avg_df <- collect(agg(groupBy(df_t1, "domain"),AVG=avg(df_t1$totalUsers), STD=sd(cast(df_t1$totalUsers, "double"),na.rm = FALSE)))

想法？

编辑：架构是

> printSchema(df_t1)
root
 |-- created: integer (nullable = true)
 |-- firstItem: integer (nullable = true)
 |-- domain: string (nullable = true)
 |-- t: integer (nullable = true)
 |-- groupId: string (nullable = true)
 |-- email: integer (nullable = true)
 |-- chat: integer (nullable = true)

我的 Spark 版本是 1.5.2

Answer 1

您使用的 Spark 1.5 不提供更高级的统计摘要，并且在 Spark DataFrame 上操作时您不能使用标准 R 函数。 avg() 有效，因为它实际上是 Spark 1.5 中可用的 Spark SQL 函数。

Spark 1.6 中引入了其他统计摘要，包括计算标准偏差（sd、stddev stddev_samp 和 stddev_pop）和方差（var、variance、var_samp、var_pop）。您当然仍然可以使用

中所示的众所周知的公式来计算标准偏差

SparkR - as.double(x) 中的错误：无法将类型 'S4' 强制转换为类型 'double' 的向量

SparkR - Error in as.double(x) : cannot coerce type 'S4' to vector of type 'double'

r

dataframe

apache-spark

apache-spark-sql