为每个组 ID 创建所有可能的非 NA 值组合

Question

类似于问题，但有一个额外的转折：

给定以下数据框：

txt <- "ID    Col1    Col2    Col3    Col4
        1     6       10      NA      NA
        1     5       10      NA      NA
        1     NA      10      15      20
        2     17      25      NA      NA
        2     13      25      NA      NA
        2     NA      25      21      34
        2     NA      25      35      40"
DF <- read.table(text = txt, header = TRUE)

DF
  ID Col1 Col2 Col3 Col4
1  1    6   10   NA   NA
2  1    5   10   NA   NA
3  1   NA   10   15   20
4  2   17   25   NA   NA
5  2   13   25   NA   NA
6  2   NA   25   21   34
7  2   NA   25   35   40

我希望按组 ID（类似于本例中的 Col2）折叠行，并且当每个组存在超过 1 个组合时，return 所有组合，如下所示：

  ID Col1 Col2 Col3 Col4
1  1    6   10   15   20
2  1    5   10   15   20
3  2   17   25   21   34
4  2   13   25   21   34
5  2   17   25   35   40
6  2   13   25   35   40

重要的是，以后我将需要它来处理非数字数据。有什么建议么？谢谢！

Answer 1

按 'ID'、fill 其他列分组，ungroup 删除分组属性并保留 distinct 行

library(dplyr)
library(tidyr)
DF %>% 
    group_by(ID) %>% 
    fill(everything(), .direction = 'updown') %>%
    ungroup %>% 
    distinct(.keep_all = TRUE)

或者也可能是

DF %>% 
   group_by(ID) %>% 
   mutate(across(everything(), ~ replace(., is.na(.), 
           rep(.[!is.na(.)], length.out = sum(is.na(.))))))

或根据评论

DF %>%
   group_by(ID) %>%
   mutate(across(where(~ any(is.na(.))), ~ {
        i1 <- is.na(.)
        ind <- which(i1)
        i2 <- !i1
        if(i1[1] == 1) rep(.[i2], each = n()/sum(i2)) else 
               rep(.[i2], length.out = n())
     })) %>%
   ungroup %>% 
   distinct(.keep_all = TRUE)

-输出

# A tibble: 6 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     6    10    15    20
2     1     5    10    15    20
3     2    17    25    21    34
4     2    13    25    21    34
5     2    17    25    35    40
6     2    13    25    35    40

Answer 2

一个 data.table 选项使用 zoo 的 na.locf 来填充缺失值。

library(zoo)
library(data.table)

setDT(DF)
cols <- grep('Col', names(DF), value = TRUE)
DF[, (cols) := lapply(.SD, function(x) fcoalesce(na.locf(x, na.rm = FALSE), 
                      na.locf(x, na.rm = FALSE, fromLast = TRUE))), ID]
unique(DF)

#   ID Col1 Col2 Col3 Col4
#1:  1    6   10   15   20
#2:  1    5   10   15   20
#3:  2   17   25   21   34
#4:  2   13   25   21   34
#5:  2   13   25   35   40

Answer 3

，OP 指出：

for my purposes I don't care about the arrangement of elements so long as all combinations of Col1 and (Col3&Col4) per group ID exist in the output

所以，如果我理解正确的话，问题是不是关于崩溃，而是关于创建 非 NA 值 的所有可能组合每个 ID 组的列 Col1、Col2 和组合列（Col3、Col4）。

为此，tidyr 包中的 expand() 和 nesting() 可用于创建组合。 na.omit() 删除包含任何 NA 之后的所有行 :

library(dplyr)
library(tidyr)
DF %>% 
  group_by(ID) %>% 
  expand(Col1, Col2, nesting(Col3, Col4)) %>% 
  na.omit() %>% 
  ungroup()

     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     1     6    10    15    20
3     2    13    25    21    34
4     2    13    25    35    40
5     2    17    25    21    34
6     2    17    25    35    40

这种方法也适用于非数值数据。

编辑 1

再想一想，我想知道输入数据集的特殊结构，即 NAs:

的位置

DF

  ID Col1 Col2 Col3 Col4
1  1    6   10   NA   NA
2  1    5   10   NA   NA
3  1   NA   10   15   20
4  2   17   25   NA   NA
5  2   13   25   NA   NA
6  2   NA   25   21   34
7  2   NA   25   35   40

对我来说，DF 似乎是由三个独立的子集构成的，第一个 Col1

第二个 Col2

  ID Col2
1  1   10
4  2   25

第三个 Col3 和 Col4

  ID Col3 Col4
3  1   15   20
6  2   21   34
7  2   35   40

基于这一观察，这是一种不同的方法，它通过子集的一系列合并操作（笛卡尔连接）创建子集的所有可能组合：

library(magrittr) # piping used her to improve readability
list("Col1", "Col2", c("Col3", "Col4")) %>% 
  lapply(function(x) DF[c("ID", x)] %>% na.omit %>% unique) %>% 
  Reduce(merge, .)

  ID Col1 Col2 Col3 Col4
1  1    6   10   15   20
2  1    5   10   15   20
3  2   17   25   21   34
4  2   17   25   35   40
5  2   13   25   21   34
6  2   13   25   35   40

此处，lapply() 创建输入数据集的子集列表，然后使用 Reduce().

重复合并该列表

编辑 2：

在 4.1.0 版本中，R 获得了简单的本地正向管道语法 |> 和 \() 作为 function() 的 shorthand 符号。有了这个，编辑 1 的代码可以重写为仅使用基 R（没有 magrittr）：

list("Col1", "Col2", c("Col3", "Col4")) |> 
  lapply(\(x) DF[c("ID", x)] |> na.omit() |> unique()) |>
  (\(z) Reduce(merge, z))()

为每个组 ID 创建所有可能的非 NA 值组合

Create all possible combinations of non-NA values for each group ID

aggregate

r

na

data.table

编辑 1

编辑 2：