通过匹配 R 中的 ID(或 bash)聚合和汇总 1500 个文件中的列
Aggregating and summing columns across 1500 files by matching IDs in R (or bash)
我有 1500 个相同格式的文件(来自 PLINK2 https://www.cog-genomics.org/plink/2.0/formats#scount 的 .scount 文件格式),示例如下:
#IID HOM_REF_CT HOM_ALT_SNP_CT HET_SNP_CT DIPLOID_TRANSITION_CT DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT DIPLOID_SINGLETON_CT HAP_REF_INCL_FEMALE_Y_CT HAP_ALT_INCL_FEMALE_Y_CT MISSING_INCL_FEMALE_Y_CT
LP5987245 10 0 6 53 0 52 0 67 70 32
LP098324 34 51 10 37 100 12 59 11 49 0
LP908325 0 45 39 54 68 48 51 58 31 2
LP0932325 7 72 0 2 92 64 13 52 0 100
LP08324 92 93 95 39 23 0 27 75 49 14
LP034252 85 46 10 69 20 8 80 81 94 23
实际上每个文件有 80000 个 IID,大小大约为 1-10MB。每个 IID 都是唯一的,每个文件只能找到一次。
我想创建一个由 IID 匹配的单个文件,每个列值相加。文件中的列名称相同。
我试过:
fnames <- list.files(pattern = "\.scount")
df_list <- lapply(fnames, read.table, header = TRUE)
df_all <- do.call(rbind, df_list)
x <- aggregate(IID ~ , data = df_all, sum)
但这对于文件数量来说真的很慢,#IID 列开头的 # 真的很难解决。
如有任何帮助,我们将不胜感激
一个tidyverse
解决方案
df2 <- df
df3 <- df
df_list <- list(df,df2,df3)
df_all <- do.call(rbind, df_list)
library(dplyr)
df_all %>%
group_by(IID) %>%
summarise_all(sum)
data.table
的解决方案
df_list <- list(df,df2,df3)
df_all <- do.call(rbind, df_list)
library(data.table)
setDT(df_all)
df_all[, lapply(.SD, sum), by=IID]
要忽略“#”请参阅 Cannot read file with "#" and space using read.table or read.csv in R
处理大量文件时,并行读取可能会更快
library(data.table)
library(parallel)
# Get locations of files to read
fnames <- list.files(pattern = "\.scount", full.names = TRUE)
# Get column names from first file (assuming all files have the same number of columns)
cnames <- as.vector(t(data.table::fread(fnames[1], nrows = 1, header = FALSE)))
# Initiate cluster
# Calculate the number of cores available (or hard code)
no_cores <- parallel::detectCores()
# Initiate cluster
cl <- parallel::makeCluster(no_cores)
# Read in files parallel, rowbind
DT <- data.table::rbindlist(
parallel::parLapply(cl, fnames, data.table::fread(skip = 1, header = FALSE )),
use.names = TRUE, fill = TRUE)
# Stop cluster
parallel::stopCluster(cl)
# Set column names
data.table::setnames(DT, old = names(DT), new = cnames)
我有 1500 个相同格式的文件(来自 PLINK2 https://www.cog-genomics.org/plink/2.0/formats#scount 的 .scount 文件格式),示例如下:
#IID HOM_REF_CT HOM_ALT_SNP_CT HET_SNP_CT DIPLOID_TRANSITION_CT DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT DIPLOID_SINGLETON_CT HAP_REF_INCL_FEMALE_Y_CT HAP_ALT_INCL_FEMALE_Y_CT MISSING_INCL_FEMALE_Y_CT
LP5987245 10 0 6 53 0 52 0 67 70 32
LP098324 34 51 10 37 100 12 59 11 49 0
LP908325 0 45 39 54 68 48 51 58 31 2
LP0932325 7 72 0 2 92 64 13 52 0 100
LP08324 92 93 95 39 23 0 27 75 49 14
LP034252 85 46 10 69 20 8 80 81 94 23
实际上每个文件有 80000 个 IID,大小大约为 1-10MB。每个 IID 都是唯一的,每个文件只能找到一次。
我想创建一个由 IID 匹配的单个文件,每个列值相加。文件中的列名称相同。
我试过:
fnames <- list.files(pattern = "\.scount")
df_list <- lapply(fnames, read.table, header = TRUE)
df_all <- do.call(rbind, df_list)
x <- aggregate(IID ~ , data = df_all, sum)
但这对于文件数量来说真的很慢,#IID 列开头的 # 真的很难解决。
如有任何帮助,我们将不胜感激
一个tidyverse
解决方案
df2 <- df
df3 <- df
df_list <- list(df,df2,df3)
df_all <- do.call(rbind, df_list)
library(dplyr)
df_all %>%
group_by(IID) %>%
summarise_all(sum)
data.table
df_list <- list(df,df2,df3)
df_all <- do.call(rbind, df_list)
library(data.table)
setDT(df_all)
df_all[, lapply(.SD, sum), by=IID]
要忽略“#”请参阅 Cannot read file with "#" and space using read.table or read.csv in R
处理大量文件时,并行读取可能会更快
library(data.table)
library(parallel)
# Get locations of files to read
fnames <- list.files(pattern = "\.scount", full.names = TRUE)
# Get column names from first file (assuming all files have the same number of columns)
cnames <- as.vector(t(data.table::fread(fnames[1], nrows = 1, header = FALSE)))
# Initiate cluster
# Calculate the number of cores available (or hard code)
no_cores <- parallel::detectCores()
# Initiate cluster
cl <- parallel::makeCluster(no_cores)
# Read in files parallel, rowbind
DT <- data.table::rbindlist(
parallel::parLapply(cl, fnames, data.table::fread(skip = 1, header = FALSE )),
use.names = TRUE, fill = TRUE)
# Stop cluster
parallel::stopCluster(cl)
# Set column names
data.table::setnames(DT, old = names(DT), new = cnames)