分组,总结,在 R 中传播不起作用

Group by, summarize, spread in R not working

我有一个如下所示的数据框:

  ID  Code  Desc
  1   0A    Red
  1   NA    Red
  2   1A    Blue
  3   2B    Green

我想先创建一个新列,在其中连接 ID 相同的代码列中的值。所以:

  ID  Combined_Code  Desc
  1    0A | NA       Red
  2    1A            Blue
  3    2B            Green

那我就想把原来的Code专栏拿来传播一下。在这种情况下,值将是每个代码针对给定 ID 出现的次数。所以:

  ID  Combined_Code 0A  NA  1A  2B  Desc
  1    0A | NA      1   1   0   0   Red
  2    1A           0   0   1   0   Blue
  3    2B           0   0   0   1   Green

我试过:

sample_data %>%
 group_by(ID) %>%
 summarise(Combined_Code = paste(unique(Combined_Code), collapse ='|'))

这适用于创建串联。但是,我不能让它与 spread 一起工作:

 sample_data %>%
  group_by(ID) %>%
  summarise(Combined_Code = paste(unique(Combined_Code), collapse ='|'))

sample_data <- spread(count(sample_data, ID, Combined_Code, Desc., Code), Code, n, fill = 0)

这样做会传播,但会丢弃串联。我也用过滤器而不是总结来尝试这个,它给出了相同的结果。这导致:

 ID  Combined_Code 0A  NA  1A  2B  Desc
  1    0A          1   0   0   0   Red
  1    NA          0   1   0   0   Red
  2    1A          0   0   1   0   Blue
  3    2B          0   0   0   1   Green

最后,我尝试通过汇总函数进行管道传播:

sample_data %>%
  group_by(ID) %>%
  summarise(Combined_Code = paste(unique(Combined_Code), collapse ='|')) %>%
  spread(count(sample_data, ID, Combined_Code, Desc., Code), Code, n, fill = 0)

这会导致错误:

Error: `var` must evaluate to a single number or a column name, not a list
Run `rlang::last_error()` to see where the error occurred.

我能做些什么来解决这些问题?

我们可以做一个小组paste

library(dplyr)
library(stringr)
df1 %>%
   group_by(ID, Desc) %>%
   summarise(Combined_Code = str_c(Code, collapse="|"))
# A tibble: 3 x 3
# Groups:   ID [3]
#     ID Desc  Combined_Code
#  <int> <chr> <chr>        
#1     1 Red   0A|0B        
#2     2 Blue  1A           
#3     3 Green 2B     

对于第二种情况,在创建一个'val'列1s之后,paste'Code'元素按'ID'、'Desc'分组后,然后使用 tidyr 中的 pivot_wider 将 'long' 重塑为 'wide format.

library(tidyr)
df1 %>% 
   mutate(val = 1) %>%
   group_by(ID, Desc) %>% 
   mutate(Combined_Code = str_c(Code, collapse="|")) %>% 
   pivot_wider(names_from = Code, values_from = val, values_fill = list(val = 0))
# A tibble: 3 x 7
# Groups:   ID, Desc [3]
#    ID Desc  Combined_Code  `0A`  `0B`  `1A`  `2B`
#  <int> <chr> <chr>         <dbl> <dbl> <dbl> <dbl>
#1     1 Red   0A|0B             1     1     0     0
#2     2 Blue  1A                0     0     1     0
#3     3 Green 2B                0     0     0     1

OP 的预期输出是

  ID  Combined_Code 0A  0B  1A  2B  Desc
  1    0A | 0B      1   1   0   0   Red
  2    1A           0   0   1   0   Blue
  3    2B           0   0   0   1   Green

更新

对于更新后的数据集,'Code'中有NA个元素,默认情况下str_creturnsNA如果有任何NA作为一个的元素,而 paste 仍然 returns NA 以及其他元素。在这里,我们将 str_c 替换为 paste

df2 %>% 
    mutate(val = 1) %>%
    group_by(ID, Desc) %>% 
    mutate(Combined_Code = paste(Code, collapse="|")) %>% 
    pivot_wider(names_from = Code, values_from = val, values_fill = list(val = 0))
# A tibble: 3 x 7
# Groups:   ID, Desc [3]
#     ID Desc  Combined_Code  `0A`  `NA`  `1A`  `2B`
#  <int> <chr> <chr>         <dbl> <dbl> <dbl> <dbl>
#1     1 Red   0A|NA             1     1     0     0
#2     2 Blue  1A                0     0     1     0
#3     3 Green 2B                0     0     0     1

数据

df1 <- structure(list(ID = c(1L, 1L, 2L, 3L), Code = c("0A", "0B", "1A", 
"2B"), Desc = c("Red", "Red", "Blue", "Green")), 
class = "data.frame", row.names = c(NA, 
-4L))



df2 <- structure(list(ID = c(1L, 1L, 2L, 3L), Code = c("0A", NA, "1A", 
"2B"), Desc = c("Red", "Red", "Blue", "Green")), class = "data.frame",
row.names = c(NA, 
-4L))