循环文件 - 解析文件并按标识符分组
Loop on files - Parse Files and group them by identifier
我愿意:
- 从目录中读取 *.bed 文件列表
- 对于我文件夹中的所有 .bed 文件,我想使用所有行中包含的信息 id=NAME,所有 * 中第五列的一部分.bed 文件(例如下面的 Hox.bed 和 zinc.bed)
- 使用单独的查找 table 将
id
值链接到 Family
值(例如查找 Table以下)
- Combine/concatenate 将同一系列的所有文件(例如 HOX.bed 和 zinc.bed)合并到一个 .bed 文件中。
- 使用列系列的名称保存链接文件(例如 cram-2.bed)。
示例:
HOX.bed 文件行:
ma reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
zinc.bed 文件行:
ma reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
查找 table :
Name Family
HOX cram-2
zinc cram-2
fire sf.xr
fire ra.XS-2
...continues...
我搜索得到的输出:
文件名=cram-2.bed
连接 HOX.bed 和 zinc.bed,因为它们都来自 Family cram-2!
ma reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
ma reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
我开始准备一个脚本结构,但我正在努力设置所有具有相同 Family 的文件必须以相同的输出文件(可能是 .bed)结尾
myFiles <- list.files(pattern = "\.bed$")
for(i in myFiles){
name <- read.table((i), header = FALSE, sep="\t", stringsAsFactors=FALSE, quote="")
name <- name %>% top_n(1, "id")
Family_filtering <-
table %>% filter(
Family %in% name)
save(...????????...)
}
非常感谢您的帮助!!!
将每个 activity 转换为一个函数,然后将它们组合在一起。是不是很简单?!?
library(fs)
library(tidyverse)
dfNameFamily = tibble(
Name = c("HOX", "zinc", "fire", "fire2"),
Family = c("cram-2", "cram-2", "sf.xr", "ra.XS-2"))
dir = "bedfile"
BedFile = function(dir) dir_ls(dir, regexp = "\.bed$")
readTxt = function(FileName){
lines = character()
if(file_exists(FileName)){
con = file(FileName, open = "r")
lines = readLines(con)
close(con)
}
lines
}
GetName = function(l) str_match(l, "id=(.+);seq")[1,2]
SaveFile = function(l, name, dir){
con = file(paste0(dir, "/" , name))
writeLines(unlist(l$lines), con)
close(con)
}
tibble(FileName = BedFile(dir)) %>% #Read all bed file names
mutate(
lines = map(FileName, readTxt), #Read all lines from any bed file
Name = map_chr(lines, GetName)) %>% #Get Name for eny bed file
left_join(dfNameFamily, by="Name") %>% #Join Family
group_by(Family) %>%
group_walk(SaveFile, dir) #Save Family file
我愿意:
- 从目录中读取 *.bed 文件列表
- 对于我文件夹中的所有 .bed 文件,我想使用所有行中包含的信息 id=NAME,所有 * 中第五列的一部分.bed 文件(例如下面的 Hox.bed 和 zinc.bed)
- 使用单独的查找 table 将
id
值链接到Family
值(例如查找 Table以下) - Combine/concatenate 将同一系列的所有文件(例如 HOX.bed 和 zinc.bed)合并到一个 .bed 文件中。
- 使用列系列的名称保存链接文件(例如 cram-2.bed)。
示例:
HOX.bed 文件行:
ma reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
zinc.bed 文件行:
ma reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
查找 table :
Name Family
HOX cram-2
zinc cram-2
fire sf.xr
fire ra.XS-2
...continues...
我搜索得到的输出:
文件名=cram-2.bed
连接 HOX.bed 和 zinc.bed,因为它们都来自 Family cram-2!
ma reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
ma reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
我开始准备一个脚本结构,但我正在努力设置所有具有相同 Family 的文件必须以相同的输出文件(可能是 .bed)结尾
myFiles <- list.files(pattern = "\.bed$")
for(i in myFiles){
name <- read.table((i), header = FALSE, sep="\t", stringsAsFactors=FALSE, quote="")
name <- name %>% top_n(1, "id")
Family_filtering <-
table %>% filter(
Family %in% name)
save(...????????...)
}
非常感谢您的帮助!!!
将每个 activity 转换为一个函数,然后将它们组合在一起。是不是很简单?!?
library(fs)
library(tidyverse)
dfNameFamily = tibble(
Name = c("HOX", "zinc", "fire", "fire2"),
Family = c("cram-2", "cram-2", "sf.xr", "ra.XS-2"))
dir = "bedfile"
BedFile = function(dir) dir_ls(dir, regexp = "\.bed$")
readTxt = function(FileName){
lines = character()
if(file_exists(FileName)){
con = file(FileName, open = "r")
lines = readLines(con)
close(con)
}
lines
}
GetName = function(l) str_match(l, "id=(.+);seq")[1,2]
SaveFile = function(l, name, dir){
con = file(paste0(dir, "/" , name))
writeLines(unlist(l$lines), con)
close(con)
}
tibble(FileName = BedFile(dir)) %>% #Read all bed file names
mutate(
lines = map(FileName, readTxt), #Read all lines from any bed file
Name = map_chr(lines, GetName)) %>% #Get Name for eny bed file
left_join(dfNameFamily, by="Name") %>% #Join Family
group_by(Family) %>%
group_walk(SaveFile, dir) #Save Family file