跨组折叠行并删除重复项和 NA

Question

我想折叠组内各行的值并删除重复项和 NA。我尝试了几种 {tidyverse} 方法，包括 purrr::nest、dplyr::summarize(x = paste(x, collapse = ", ") and dplyr::summarize(x = list(x)`，但没有成功。我将不胜感激你的帮助！下面是输入和所需输出的代表。

# Collapse rows across group and remove duplicates and NAs

library(dplyr)

df_in <- tribble(
  ~group, ~subgroup, ~color, ~shape, ~emotion, ~shade,
  1,      "a",       "red",   NA,   "happy",   NA,
  1,      "a",       "red",   NA,   "sad",   "striped"
)

df_in
#> # A tibble: 2 × 6
#>   group subgroup color shape emotion shade  
#>   <dbl> <chr>    <chr> <lgl> <chr>   <chr>  
#> 1     1 a        red   NA    happy   <NA>   
#> 2     1 a        red   NA    sad     striped


df_out <- tribble(
  ~group, ~subgroup, ~color, ~shape, ~emotion,    ~shade,
  1,      "a",       "red",   NA,   "happy, sad", "striped"
)

df_out
#> # A tibble: 1 × 6
#>   group subgroup color shape emotion    shade  
#>   <dbl> <chr>    <chr> <lgl> <chr>      <chr>  
#> 1     1 a        red   NA    happy, sad striped

^{由 reprex package (v2.0.0)}

于 2021-11-19 创建

Answer 1

我们可以使用 group_by 和 summarise(across(everything(), ...)) 将函数应用于每一列。在我们的例子中，这个函数被写成一个公式（~ 符号），其中列被称为 .x.

按照您的建议，我们可以 paste（使用 collapse = ", "）将这些行放在一起。我用 .x[!is.na(.x)].

删除了 NA 值

df_in %>% 
  group_by(group, subgroup) %>% 
  summarise(across(everything(), ~ paste(unique(.x[!is.na(.x)]), collapse = ", "))) %>% 
  ungroup()

与预期输出的唯一区别是 shape 列现在是一个空字符串，而不是 NA 值：

# A tibble: 1 x 6
  group subgroup color shape emotion    shade  
  <dbl> <chr>    <chr> <chr> <chr>      <chr>  
1     1 a        red   ""    happy, sad striped

这可以通过创建一个函数来解决，例如在粘贴之前用 NA 替换零长度列表。

paste_rows <- function(x) {
  unique_x <- unique(x[!is.na(x)])
  if (length(unique_x) == 0) {
    unique_x <- NA
  }
  
  paste(unique_x, collapse = ", ")
}

df_in %>% 
  group_by(group, subgroup) %>% 
  summarise(across(everything(), paste_rows)) %>% 
  ungroup()