循环文件 - 解析文件并按标识符分组

Loop on files - Parse Files and group them by identifier

我愿意:

  1. 从目录中读取 *.bed 文件列表
  2. 对于我文件夹中的所有 .bed 文件,我想使用所有行中包含的信息 id=NAME,所有 * 中第五列的一部分.bed 文件(例如下面的 Hox.bed 和 zinc.bed)
  3. 使用单独的查找 table 将 id 值链接到 Family 值(例如查找 Table以下)
  4. Combine/concatenate 将同一系列的所有文件(例如 HOX.bed 和 zinc.bed)合并到一个 .bed 文件中。
  5. 使用列系列的名称保存链接文件(例如 cram-2.bed)。

示例:

HOX.bed 文件行:

ma  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05

zinc.bed 文件行:

ma  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05

查找 table :

Name                        Family
HOX                         cram-2
zinc                        cram-2
fire                        sf.xr
fire                        ra.XS-2
...continues...

我搜索得到的输出:

文件名=cram-2.bed

连接 HOX.bed 和 zinc.bed,因为它们都来自 Family cram-2!

ma  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
ma  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05

我开始准备一个脚本结构,但我正在努力设置所有具有相同 Family 的文件必须以相同的输出文件(可能是 .bed)结尾

myFiles <- list.files(pattern = "\.bed$") 
for(i in myFiles){
  name <- read.table((i), header = FALSE, sep="\t", stringsAsFactors=FALSE, quote="")
  name <- name %>% top_n(1, "id")
  Family_filtering <-
    table %>% filter(
      Family %in% name)
  save(...????????...)
}

非常感谢您的帮助!!!

将每个 activity 转换为一个函数,然后将它们组合在一起。是不是很简单?!?

library(fs)
library(tidyverse)

dfNameFamily = tibble(
  Name = c("HOX", "zinc", "fire", "fire2"),
  Family = c("cram-2", "cram-2", "sf.xr", "ra.XS-2"))

dir = "bedfile"

BedFile = function(dir) dir_ls(dir, regexp = "\.bed$")

readTxt = function(FileName){
  lines = character()
  if(file_exists(FileName)){
    con = file(FileName, open = "r")
    lines = readLines(con)
    close(con)
  }
  lines
}

GetName = function(l) str_match(l, "id=(.+);seq")[1,2]

SaveFile = function(l, name, dir){
  con = file(paste0(dir, "/" , name))
  writeLines(unlist(l$lines), con)
  close(con)
}

tibble(FileName = BedFile(dir)) %>%  #Read all bed file names
  mutate(
    lines = map(FileName, readTxt),  #Read all lines from any bed file
    Name = map_chr(lines, GetName)) %>%  #Get Name for eny bed file
  left_join(dfNameFamily, by="Name") %>%  #Join Family
  group_by(Family) %>%  
  group_walk(SaveFile, dir)  #Save Family file