Sparklyr：使用 group_by 然后连接组中行的字符串

Question

我正在尝试使用 sparklyr 中的 group_by() 和 mutate() 函数来连接组中的行。

这是一个我认为应该有效但无效的简单示例：

library(sparkylr)
d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), 
             x=c("200", "200", "200", "201", "201", "201"), 
             y=c("This", "That", "The", "Other", "End", "End"))
d_sdf <- copy_to(sc, d, "d")
d_sdf %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

我希望它产生的是：

Source: local data frame [6 x 3]
Groups: id, x [4]

# A tibble: 6 x 3
      id      x         y
  <fctr> <fctr>     <chr>
1      1    200 This That
2      1    200 This That
3      2    200       The
4      2    201 Other End
5      1    201       End
6      2    201 Other End

我收到以下错误：

Error: org.apache.spark.sql.AnalysisException: missing ) at 'AS' near '' '' in selection target; line 1 pos 42

请注意，在 data.frame 上使用相同的代码效果很好：

d %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

Answer 1

Spark sql 不喜欢在不聚合的情况下使用聚合函数，因此这在 dplyr 中使用普通 dataframe 但在 [=18 中不起作用的原因=]- sparklyr 将您的命令转换为 sql 语句。如果您查看错误消息中的第二位，您会发现这是错误的：

== SQL ==
SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`

paste 被翻译成 CONCAT_WS。 concat 但是会将列粘贴在一起。

更好的等效项是 collect_list 和 collect_set，但它们会产生 list 输出。

但您可以以此为基础：

如果您不想在结果中复制相同的行，您可以使用summarise、collect_list和paste :

res <- d_sdf %>% 
      group_by(id, x) %>% 
      summarise( yconcat =paste(collect_list(y)))

结果：

Source:     lazy query [?? x 3]
Database:   spark connection master=local[8] app=sparklyr local=TRUE
Grouped by: id

     id     x         y
  <chr> <chr>     <chr>
1     1   201       End
2     2   201 Other End
3     1   200 This That
4     2   200       The

如果您确实想要复制您的行，您可以将其连接回您的原始数据：

d_sdf %>% left_join(res)

结果：

Source:     lazy query [?? x 4]
Database:   spark connection master=local[8] app=sparklyr local=TRUE

     id     x     y   yconcat
  <chr> <chr> <chr>     <chr>
1     1   200  This This That
2     1   200  That This That
3     2   200   The       The
4     2   201 Other Other End
5     1   201   End       End
6     2   201   End Other End

Sparklyr：使用 group_by 然后连接组中行的字符串

Sparklyr: Use group_by and then concatenate strings from rows in a group

r

data-science

sparklyr