将 Spark SQL 函数翻译成 "normal" R 代码

Question

我正在尝试关注 Vignette "How to make a Markov Chain" (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/)。

本教程很有趣，因为它使用的数据源与我使用的相同。但是，部分代码正在使用 "Spark SQL code"（我从上一个问题中得到的）。

我的问题：我在谷歌上搜索了很多，并试图自己解决这个问题。但是我不知道怎么做，因为我不知道数据到底应该是什么样子（作者没有给出函数前后他的 DF 的例子）。

如何将这段代码转换为 "normal" R 代码（不使用 Spark）（特别是：concat_ws 和 collect_list 函数造成了麻烦

他正在使用这行代码：

channel_stacks = data_feed_tbl %>%
 group_by(visitor_id, order_seq) %>%
 summarize(
   path = concat_ws(" > ", collect_list(mid_campaign)),
   conversion = sum(conversion)
 ) %>% ungroup() %>%
 group_by(path) %>%
 summarize(
   conversion = sum(conversion)
 ) %>%
 filter(path != "") %>%
 collect()

从我之前的问题，我知道我们可以替换一部分代码：

concat_ws() can be replaced the paste() function

但是，代码的另一部分再次跳入：

collect_list()  # describtion: Aggregate function: returns a list of objects with duplicates.

我希望我把这个问题描述得尽可能清楚。

Answer 1

paste 能够使用 collapse 参数提供的分隔符折叠字符串向量。

这可以作为 concat_ws(" > ", collect_list(mid_campaign))

的替代品

channel_stacks = data_feed_tbl %>%
     group_by(visitor_id, order_seq) %>%
     summarize(
       path = paste(mid_campaign, collapse = " > "),
       conversion = sum(conversion)
     ) %>% ungroup() %>%
     group_by(path) %>%
     summarize(
       conversion = sum(conversion)
     ) %>%
     filter(path != "")

将 Spark SQL 函数翻译成 "normal" R 代码

Translate Spark SQL function to "normal" R code

r

dplyr

sparklyr