使用 purrr 遍历两个列表,然后通过管道输入 dplyr::filter

Using purrr to iterate over two lists and then pipe into dplyr::filter

library(tidyverse)
library(purrr)

使用下面的示例数据,我可以创建以下函数:

Funs <- function(DF, One, Two){

    One <- enquo(One)
    Two <- enquo(Two)

    DF %>% filter(School == (!!One) & Code == (!!Two)) %>%
        group_by(Code, School) %>%
        summarise(Count = sum(Question1))
}

然后我可以使用该函数过滤两个变量 - 学校和代码 - 如下所示:

Funs(DF, "School1", "B344")

这一切都很好,但我的实际数据有很多变量,因此不必不断地将 "School" 和 "Code" 变量键入函数,我想使用 tidyverse 和 purrr包循环两个列表(一个学校,一个代码)并将其提供给过滤器。我希望输出是结果列表。

为了简单起见,要输入 dplyr::filter 的两个列表各只有两个值:School2 将使用 S300,School1 将使用 B344,就像上面的示例一样。

我试过的一些例子:

map2(c(“School2”, ”School1”),
     c(“S300”, ”B344”),
     function(x,y) {
         DF %>% filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1))
     }

还有...

map2(c("School2", "School1")),
     c("S300","B344"),
     ~filter(School == .x & Code == .y) %>%
         group_by(Code, School)%>%
         summarise(Count = sum(Question1))

还有这个……

list(c("School2", "School1"), c("S300", "B344")) %>%
    map2( ~ filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1)))

None 这些似乎有效,因此将不胜感激!

示例数据:

Code <- c("B344","B555","S300","T220","B888","B888","B555","B344","B344","T220","B555","B555","S300","B555","S300","S300","S300","S300","B344","B344","B888","B888","B888")
School <- c("School1","School1","School2","School3","School4","School4","School1","School1","School3","School3","School4","School1","School1","School3","School2","School2","School4","School2","School3","School4","School3","School1","School2")
Question1 <- c(3,4,5,4,5,5,5,4,5,3,4,5,4,5,4,3,3,3,4,5,4,3,3)
Question2 <- c(5,4,3,4,3,5,4,3,2,3,4,5,4,5,4,3,4,4,5,4,3,3,4)
DF <- data_frame(Code, School, Question1, Question2)

这里有一些选项,从最像您的代码到最优化:

library(tidyverse)

DF <- data_frame(Code = c("B344", "B555", "S300", "T220", "B888", "B888", "B555", "B344", "B344", "T220", "B555", "B555", "S300", "B555", "S300", "S300", "S300", "S300", "B344", "B344", "B888", "B888", "B888"), 
                 School = c("School1", "School1", "School2", "School3", "School4", "School4", "School1", "School1", "School3", "School3", "School4", "School1", "School1", "School3", "School2", "School2", "School4", "School2", "School3", "School4", "School3", "School1", "School2"), 
                 Question1 = c(3, 4, 5, 4, 5, 5, 5, 4, 5, 3, 4, 5, 4, 5, 4, 3, 3, 3, 4, 5, 4, 3, 3), 
                 Question2 = c(5, 4, 3, 4, 3, 5, 4, 3, 2, 3, 4, 5, 4, 5, 4, 3, 4, 4, 5, 4, 3, 3, 4))

wanted <- data_frame(School = c("School2", "School1"),
                     Code = c("S300", "B344"))

为了让 map2 工作,如果使用波浪符号,变量被命名为 .x.y;如果您使用常规函数符号,则可以随意调用它们。不要忘记 filter 的第一个参数是管道传入的数据帧,所以:

map2_dfr(wanted$School, wanted$Code, ~filter(DF, School == .x, Code == .y)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

由于我将 wanted 设置为数据框(原始列表也可以),您可以改用 pmap。对于两个变量,带有 pmap 的参数名称实际上可以与 map2 相同,但它实际上是一个带有 ... 参数的函数,因此以不同的方式处理它们通常是有意义的,例如使用 ..1 表示法:

wanted %>% 
    pmap_dfr(~filter(DF, School == ..1, Code == ..2)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

上述两种技术的问题在于,在规模上,它们会很慢,因为它们 运行 filter 对于 wanted 的每一行,这意味着你多次重新测试每一行。为了使代码保持相似,避免额外工作的一种有点 hacky 的方法是将列合并为一个列,例如tidyr::unite:

DF %>% 
    unite(school_code, School, Code) %>% 
    filter(school_code %in% invoke(paste, wanted, sep = '_')) %>%    # or paste(wanted$School, wanted$Code, sep = '_') or equivalent
    separate(school_code, c('School', 'Code')) %>%
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

...或者将它们合并到 filter 本身:

DF %>% 
    filter(paste(School, Code) %in% paste(wanted$School, wanted$Code)) %>%    # or invoke(paste, wanted)
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

获得所需结果的最佳 方法可能更明显,因为我已经将 wanted 设置为数据框:一个连接,它被设计准确地完成这项工作:

DF %>% 
    inner_join(wanted) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> Joining, by = c("Code", "School")
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0