通过匹配 R 中的 ID(或 bash)聚合和汇总 1500 个文件中的列

Aggregating and summing columns across 1500 files by matching IDs in R (or bash)

我有 1500 个相同格式的文件(来自 PLINK2 https://www.cog-genomics.org/plink/2.0/formats#scount 的 .scount 文件格式),示例如下:

#IID    HOM_REF_CT  HOM_ALT_SNP_CT  HET_SNP_CT  DIPLOID_TRANSITION_CT   DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT   DIPLOID_SINGLETON_CT    HAP_REF_INCL_FEMALE_Y_CT    HAP_ALT_INCL_FEMALE_Y_CT    MISSING_INCL_FEMALE_Y_CT
LP5987245   10  0   6   53  0   52  0   67  70  32
LP098324    34  51  10  37  100 12  59  11  49  0
LP908325    0   45  39  54  68  48  51  58  31  2
LP0932325   7   72  0   2   92  64  13  52  0   100
LP08324 92  93  95  39  23  0   27  75  49  14
LP034252    85  46  10  69  20  8   80  81  94  23

实际上每个文件有 80000 个 IID,大小大约为 1-10MB。每个 IID 都是唯一的,每个文件只能找到一次。

我想创建一个由 IID 匹配的单个文件,每个列值相加。文件中的列名称相同。

我试过:

fnames <- list.files(pattern = "\.scount")
df_list <- lapply(fnames, read.table, header = TRUE)
df_all <- do.call(rbind, df_list)
x <- aggregate(IID ~ , data = df_all, sum)

但这对于文件数量来说真的很慢,#IID 列开头的 # 真的很难解决。

如有任何帮助,我们将不胜感激

一个tidyverse解决方案

df2 <- df
df3 <- df

df_list <- list(df,df2,df3)

df_all <- do.call(rbind, df_list)

library(dplyr)

df_all %>%
group_by(IID) %>%
summarise_all(sum)

data.table

的解决方案
df_list <- list(df,df2,df3)

df_all <- do.call(rbind, df_list)

library(data.table)

setDT(df_all)
df_all[, lapply(.SD, sum), by=IID]

要忽略“#”请参阅 Cannot read file with "#" and space using read.table or read.csv in R

处理大量文件时,并行读取可能会更快

library(data.table)
library(parallel)

# Get locations of files to read
fnames <- list.files(pattern = "\.scount", full.names = TRUE)
# Get column names from first file (assuming all files have the same number of columns)
cnames <- as.vector(t(data.table::fread(fnames[1], nrows = 1, header = FALSE)))

# Initiate cluster
# Calculate the number of cores available (or hard code)
no_cores <- parallel::detectCores()
# Initiate cluster
cl <- parallel::makeCluster(no_cores)

# Read in files parallel, rowbind
DT <- data.table::rbindlist( 
  parallel::parLapply(cl, fnames, data.table::fread(skip = 1, header = FALSE )),
  use.names = TRUE, fill = TRUE)
# Stop cluster
parallel::stopCluster(cl)

# Set column names
data.table::setnames(DT, old = names(DT), new = cnames)