R 数据 table 如何将行与错误检查结合起来

R data table how to unite rows with error check

我有下一个数据table数据帧

library(dplyr)
library(data.table)

my_data = data.frame(
  id = c(1, 1, 2, 2, 3),
  sample_number = c('d1', 'rr1', 'd2', 'rr2', 'd3'),
  res_1 = c('AA', NA, NA, 'GG', 'AG'),
  res_2 = c(NA, 'TT', 'CC', NA, 'TC'),
  res_3 = c('II', 'II', 'DD', 'ID', 'ID')
)
my_data <- my_data %>% as.data.table() ## convert to data table
> my_data
  id sample_number res_1 res_2 res_3
1  1            d1    AA   <NA>   II
2  1           rr1   <NA>   TT    II
3  2            d2   <NA>   CC    DD
4  2           rr2    GG   <NA>   ID
5  3            d3    AG    TC    ID

Uniq 列是 id。对于某些 id 存在 2 行,在 sample_number 列中具有不同的值。如何按 id 列合并行? 对于第 res_3 列中的 id 2,存在错误。在那种情况下,联合的结果将是'---'。结果是下一个

id  sample_number   res_1 res_2   res_3
1   d1, rr1         AA    TT      II
2   d2, rr2         GG    CC      '---'
3   d3              AG    TC      ID

这里有一个选项

# Define custom function to collapse entries from columns `res_*`
collapse <- function(x) {
    if (length(unique(x[!is.na(x)])) == 1) unique(x[!is.na(x)]) else "----"
}
library(tidyverse)
my_data %>%
    group_by(id) %>%
    summarise(
        sample_number = toString(sample_number),
        across(starts_with("res"), collapse),
        .groups = "drop")
## A tibble: 3 x 5
#     id sample_number res_1 res_2 res_3
#  <dbl> <chr>         <chr> <chr> <chr>
#1     1 d1, rr1       AA    TT    II   
#2     2 d2, rr2       GG    CC    ---- 
#3     3 d3            AG    TC    ID   

请注意,我假设您的 data.frame 中的 NA 真实的 NA,如

my_data = data.frame(
    id = c(1, 1, 2, 2, 3),
    sample_number = c('d1', 'rr1', 'd2', 'rr2', 'd3'),
    res_1 = c('AA', NA, NA, 'GG', 'AG'),
    res_2 = c(NA, 'TT', 'CC', NA, 'TC'),
    res_3 = c('II', 'II', 'DD', 'ID', 'ID')
)

一种data.table方法

my_data[, sample_number := paste0(sample_number, collapse = ", "), by = .(id)]
DT <- melt(my_data, id.vars = c("id", "sample_number"), na.rm = TRUE)
dcast(DT, id + sample_number ~ variable, value.var = "value", 
      fun.aggregate = function(x) ifelse(length(unique(x)) > 1, "---", x))
#    id sample_number res_1 res_2 res_3
# 1:  1       d1, rr1    AA    TT    II
# 2:  2       d2, rr2    GG    CC   ---
# 3:  3            d3    AG    TC    ID