R:在 sparklyr 中分组("sum","count distinct","mean")
R: Group in sparklyr ("sum", "count distinct" , "mean")
我们在工作目录中托管了以下数据:
>library(sparklyr)
>library(dplyr)
>f<-data.frame(category=c("e","EE","W","S","Q","e","Q","S"),
DD=c(33.2,33.2,14.55,12,13.4,45,7,3),
CC=c(2,44,4,44,9,2,2.2,4),
>FF=c("A","A","A","A","A","A","B","A") )
>write.csv(f,"D.csv")##Write in working directory
我们使用 spark 命令从工作目录读取文件
>sc <- spark_connect(master = "local", spark_home = "/home/tomas/spark-2.1.0-bin-hadoop2.7/", version = "2.1.0")
>df <- spark_read_csv(sc, name = "data", path = "D.csv", header = TRUE, delimiter = ",")
我想获得如下矩阵,其中按"category"分组,求和DD,计算"CC"的平均值,在"FF"[=16中计算不同=]
它会保持这样:
category SumDD MeanCC CountDistinctFF
e 78.2 2 1
EE 33.2 44. 1
WW 14.55 4 1
S 15 24 2
Q 20.4 5.6 1
我不确定您是否正在寻找特定包的解决方案,这可以使用 dplyr
包实现,我们使用 group_by
使用 category
列和 summarise
根据我们需要的结果。
这是它的示例代码。
代码:
f %>% group_by(category) %>%
summarise(sumDD = sum(DD), MeanCC = mean(CC), CountDistinctFF = length(unique(FF)))
输出:
category sumDD MeanCC CountDistinctFF
<fct> <dbl> <dbl> <int>
1 e 78.2 2 1
2 EE 33.2 44 1
3 Q 20.4 5.6 2
4 S 15 24 1
5 W 14.6 4 1
为了操作 spark DF,您需要使用 dplyr 函数。在 spark 环境中,除了最后一个变量外,Naveen 的答案都可以。您可以尝试从 dplyr
n_distinct
而不是 unique
df0=df%>%group_by(category)%>%
summarize(sumDD=sum(DD,na.rm=T),MeanCC=mean(CC,na.rm=T),CountDistinctFF=n_distinct(FF))
要将您的结果作为 spark DF 检查,您可以使用:
> glimpse(df0)
Observations: ??
Variables: 4
$ category <chr> "e", "EE", "S", "Q", "W"
$ sumDD <dbl> 78.20, 33.20, 15.00, 20.40, 14.55
$ MeanCC <dbl> 2.0, 44.0, 24.0, 5.6, 4.0
$ CountDistinctFF <dbl> 1, 1, 1, 2, 1
或者您可以收集回本地系统并像任何 R 数据帧一样进行操作
> df0%>%collect
# A tibble: 5 x 4
category sumDD MeanCC CountDistinctFF
<chr> <dbl> <dbl> <dbl>
1 e 78.2 2 1
2 EE 33.2 44 1
3 S 15 24 1
4 Q 20.4 5.6 2
5 W 14.6 4 1
作为Antonis 回复的补充方式,后来出现了一个错误。调查我发现包之间存在冲突,特别是:dplyr 和 SparkR。
通过安装 tidyverse 包并调用如下命令解决:
>library(tidyverse)
>df0=df%>%dplyr::group_by(category)%>%dplyr::summarize(sumDD=sum(DD,na.rm=T),MeanCC=mean(CC,na.rm=T),CountDistinctFF=n_distinct(FF))
>glimpse(df0)
Observations: ??
Variables: 4
$ category <chr> "e", "EE", "S", "Q", "W"
$ sumDD <dbl> 78.20, 33.20, 15.00, 20.40, 14.55
$ MeanCC <dbl> 2.0, 44.0, 24.0, 5.6, 4.0
$ CountDistinctFF <dbl> 1, 1, 1, 2, 1
我们在工作目录中托管了以下数据:
>library(sparklyr)
>library(dplyr)
>f<-data.frame(category=c("e","EE","W","S","Q","e","Q","S"),
DD=c(33.2,33.2,14.55,12,13.4,45,7,3),
CC=c(2,44,4,44,9,2,2.2,4),
>FF=c("A","A","A","A","A","A","B","A") )
>write.csv(f,"D.csv")##Write in working directory
我们使用 spark 命令从工作目录读取文件
>sc <- spark_connect(master = "local", spark_home = "/home/tomas/spark-2.1.0-bin-hadoop2.7/", version = "2.1.0")
>df <- spark_read_csv(sc, name = "data", path = "D.csv", header = TRUE, delimiter = ",")
我想获得如下矩阵,其中按"category"分组,求和DD,计算"CC"的平均值,在"FF"[=16中计算不同=]
它会保持这样:
category SumDD MeanCC CountDistinctFF
e 78.2 2 1
EE 33.2 44. 1
WW 14.55 4 1
S 15 24 2
Q 20.4 5.6 1
我不确定您是否正在寻找特定包的解决方案,这可以使用 dplyr
包实现,我们使用 group_by
使用 category
列和 summarise
根据我们需要的结果。
这是它的示例代码。
代码:
f %>% group_by(category) %>%
summarise(sumDD = sum(DD), MeanCC = mean(CC), CountDistinctFF = length(unique(FF)))
输出:
category sumDD MeanCC CountDistinctFF
<fct> <dbl> <dbl> <int>
1 e 78.2 2 1
2 EE 33.2 44 1
3 Q 20.4 5.6 2
4 S 15 24 1
5 W 14.6 4 1
为了操作 spark DF,您需要使用 dplyr 函数。在 spark 环境中,除了最后一个变量外,Naveen 的答案都可以。您可以尝试从 dplyr
n_distinct
而不是 unique
df0=df%>%group_by(category)%>%
summarize(sumDD=sum(DD,na.rm=T),MeanCC=mean(CC,na.rm=T),CountDistinctFF=n_distinct(FF))
要将您的结果作为 spark DF 检查,您可以使用:
> glimpse(df0)
Observations: ??
Variables: 4
$ category <chr> "e", "EE", "S", "Q", "W"
$ sumDD <dbl> 78.20, 33.20, 15.00, 20.40, 14.55
$ MeanCC <dbl> 2.0, 44.0, 24.0, 5.6, 4.0
$ CountDistinctFF <dbl> 1, 1, 1, 2, 1
或者您可以收集回本地系统并像任何 R 数据帧一样进行操作
> df0%>%collect
# A tibble: 5 x 4
category sumDD MeanCC CountDistinctFF
<chr> <dbl> <dbl> <dbl>
1 e 78.2 2 1
2 EE 33.2 44 1
3 S 15 24 1
4 Q 20.4 5.6 2
5 W 14.6 4 1
作为Antonis 回复的补充方式,后来出现了一个错误。调查我发现包之间存在冲突,特别是:dplyr 和 SparkR。
通过安装 tidyverse 包并调用如下命令解决:
>library(tidyverse)
>df0=df%>%dplyr::group_by(category)%>%dplyr::summarize(sumDD=sum(DD,na.rm=T),MeanCC=mean(CC,na.rm=T),CountDistinctFF=n_distinct(FF))
>glimpse(df0)
Observations: ??
Variables: 4
$ category <chr> "e", "EE", "S", "Q", "W"
$ sumDD <dbl> 78.20, 33.20, 15.00, 20.40, 14.55
$ MeanCC <dbl> 2.0, 44.0, 24.0, 5.6, 4.0
$ CountDistinctFF <dbl> 1, 1, 1, 2, 1