在使用 R 将列表中的每个数据文件 rbind 到数据帧之前对其进行随机采样

Question

我的任务需要帮助。所以我有 121 个 .txt 文件的文件夹。每个大约 10MB 大小。对于每个 .txt 文件，它们具有几乎完全相同的 columns/headers 和不同的行。我昨天晚些时候才发现 headers 列中的差异，这可能是由于生成 .txt 文件的机器在 header 中使用了很多特殊字符，所以当我读入它们时，很有趣生意兴隆。

我想读取文件夹中的所有文件，然后将它们合并成一个大文件以供下游分析。现在我还有另外两个问题，它们的大小和潜在的维度不一致导致 fread() 代码失败。我想找到一个可以正确读取大量 .txt 文件的函数。其次，我想在读入文件后随机抽取每个文件的 20%，然后将这 20% 合并为 .csv 文件以进行下游处理。我不是很新，所以到目前为止，列表操作在概念上一直具有挑战性。最后，rbind 不起作用，因为某些文件尺寸不一致。我使用 gtools 和 smartbind 来绕过。但是类似于在创建大量文件之前的随机抽样，我是否也可以在每个正在读入的文件中将第 1 列子集化为第 131 列？

这是我的代码，它慢慢读入所有文件并将它们组合成一个大的 .csv。请赐教。

setwd("C:/Users/mli/Desktop/3S_DMSO")
library(gtools)
# Create list of text files
txt_files_ls = list.files(pattern="*.txt") 
# Read the files in, assuming comma separator
txt_files_df <- lapply(txt_files_ls, function(x) {read.csv(file = x, header = T, sep ="\t")})
# Combine them
combined_df <- do.call("smartbind", lapply(txt_files_df, as.data.frame))

write.csv(combined_df,"3SDMSO_merged.csv",row.names = F)

Answer 1

...
txt_files_df <- lapply(txt_files_ls, function(x) {
  # fread with fill=T usually works. if not, go back to read.csv
  fread(file = x, header = T, sep ="\t", fill=T)[sample(round(.2*.N))] # keep 20% of rows
})
# rbindlist with use.names=T,fill=T usually works. if not, preprocess above or go back to smartbind
combined_df <- rbindlist(txt_files_df,use.names=T,fill=T)
## Keep only columns 1 - 131
# if you don't use fread, then convert to data.table so the column selection below works:
# setDT(combined_df)
combined_df = combined_df[,1:131]
...

需要更快？见

Answer 2

您可以尝试使用 data.table 中的读写函数。 fread 有一个非常酷的 auto-start 功能，可以智能地选择列和 header 信息。

library(data.table)
setwd("C:/Users/mli/Desktop/3S_DMSO")
txt_files_ls = list.files(pattern="*.txt") 
txt_files_df <- lapply(txt_files_ls, fread)
sampled_txt_files_df <- lapply(txt_files_df,function(x){
  x[sample(1:nrow(x), ceiling(nrow(x) * 0.2)),1:131]
  })
combined_df <- rbindlist(sampled_txt_files_df)
fwrite(combined_df,"3SDMSO_merged.csv",row.names = FALSE)

在使用 R 将列表中的每个数据文件 rbind 到数据帧之前对其进行随机采样

Random Sample each datafile in my list before rbind them into a datafram using R

r

file-processing