r 列按组替换

Question

我正在处理这样一个包含重复行的数据集

Id   Date        x1    a1     Col1    Col2     Col3
1    2004-11-29  1     2      0       NA       1
1    2004-11-29  1     2      1       0        0
2    2005-04-26  2     2      NA      1        0
3    2006-10-09  1     2      1       0        1
3    2006-10-09  1     2      0       0        NA

我喜欢做的是，在相同 ID、相同日期的两行中，如果 Col1、Col2、Col3 中的值为 1，则将值替换为 1，否则为 0 或 NA，如果两者都为该列中缺少值。

例如，ID 1 两行日期相同，

Col1 0, 1 - 将值替换为 1,
Col2 NA,0 - 将值替换为 0
Col3 1, 0 - 将值替换为 1

两行

0,1 被替换为 1,
NA，NA替换为NA
NA，0 替换为 0
NA, 1 替换为 1
1, 1替换为1
0, 0 替换为 0

等等

期待这样的数据集

Id   Date        x1    a1     Col1    Col2     Col3
1    2004-11-29  1     2      1       0       1

2    2005-04-26  2     2      NA      1        0

3    2006-10-09  1     2      1       0        1

感谢您在此之前提供的任何帮助。

Answer 1

library(data.table)

#convert NA to -999
DT[is.na(DT)] <- -999
#summarise, find maximum (if all original = NA, maximum will be -999)
ans <- DT[, lapply(.SD, max), by = .(Id, Date, x1, a1), .SDcols = patterns("^Col")]
#convert -999 back to NA
ans[ans == -999] <- NA
#    Id       Date x1 a1 Col1 Col2 Col3
# 1:  1 2004-11-29  1  2    1    0    1
# 2:  2 2005-04-26  2  2   NA    1    0
# 3:  3 2006-10-09  1  2    1    0    1

使用的示例数据

DT <- fread("Id   Date        x1    a1     Col1    Col2     Col3
1    2004-11-29  1     2      0       NA       1
1    2004-11-29  1     2      1       0        0
2    2005-04-26  2     2      NA      1        0
3    2006-10-09  1     2      1       0        1
3    2006-10-09  1     2      0       0        NA")

Answer 2

如果您愿意 dplyr，您可以使用

library(dplyr)
df %>% 
  group_by(Id, Date, x1, a1) %>% 
  summarise(across(Col1:Col3, ~na_if(max(coalesce(.x, -1)), -1)),
            .groups = "drop")

这个returns

# A tibble: 3 x 7
     Id Date          x1    a1  Col1  Col2  Col3
  <dbl> <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 2004-11-29     1     2     1     0     1
2     2 2005-04-26     2     2    NA     1     0
3     3 2006-10-09     1     2     1     0     1

这里的主要思想是始终select每列和每组的最大值。这是基于值是 0 或 1 或缺失的假设。

r 列按组替换

r column replace by group

aggregate

r

dplyr