
Count number of elements in a group fulfilling a criteria

我想按特定列对 sparklyr table 的行进行分组,并计算满足特定条件的行数。


> library(sparklyr)
> library(tidyverse)
> con = spark_connect(....)

> diamonds = copy_to(con, diamonds)
> diamonds
# Source:   table<diamonds> [?? x 10]
# Database: spark_connection
   carat cut       color clarity depth table price     x     y     z
   <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.230 Ideal     E     SI2      61.5  55.0   326  3.95  3.98  2.43
 2 0.210 Premium   E     SI1      59.8  61.0   326  3.89  3.84  2.31
 3 0.230 Good      E     VS1      56.9  65.0   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4  58.0   334  4.20  4.23  2.63
 5 0.310 Good      J     SI2      63.3  58.0   335  4.34  4.35  2.75
 6 0.240 Very Good J     VVS2     62.8  57.0   336  3.94  3.96  2.48
 7 0.240 Very Good I     VVS1     62.3  57.0   336  3.95  3.98  2.47
 8 0.260 Very Good H     SI1      61.9  55.0   337  4.07  4.11  2.53
 9 0.220 Fair      E     VS2      65.1  61.0   337  3.87  3.78  2.49
10 0.230 Very Good H     VS1      59.4  61.0   338  4.00  4.05  2.39

这是我在普通 R 中会以多种方式完成的任务。但是 none 在 sparklyr 中有效。


 > diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
 > diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=length(price[price>400]))


# A tibble: 7 x 3
  color     n n_expensive
  <ord> <int>       <int>
1 D      6775        6756
2 E      9797        9758
3 F      9542        9517
4 G     11292       11257
5 H      8304        8274
6 I      5422        5379
7 J      2808        2748

但不在 spark 中:

diamonds_sdl %>% group_by(color) %>% summarise(n=n(), n_expensive=sum(price>400))
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'sum((CAST(diamonds.`price` AS BIGINT) > 400L))' due to data type mismatch: function sum requires numeric types, not BooleanType; l
ine 1 pos 33;

Error in eval_bare(call, env) : object 'price' not found

类型可能有冲突。将逻辑转换为 integer 解决了问题

con <- spark_connect(master = "local")
diamonds1 = copy_to(con, diamonds)
diamonds1 %>% 
         group_by(color) %>%
         summarise(n=n(), n_expensive = sum(as.integer(price > 400)))


你必须考虑 SQL 表达式,例如 if_else:

diamonds_sdl %>% group_by(color) %>% 
  summarise(n=n(), n_expensive=sum(if_else(price > 400, 1, 0)))

sum 演员表:

diamonds_sdl %>% group_by(color) %>%
   summarise(n=n(), n_expensive=sum(as.numeric(price > 400)))