数据框到床文件的转换
Data frame to bed file conversion
我在 R 中有相当大的数据框,我需要将其转换为 bed 文件。我使用下面的代码进行 df->bed 转换,但它非常慢。我想知道如何在 R 或 bash.
中以更快更智能的方式将 df 转换为 bed
以下是示例数据框和床文件的前几行:
数据框:
7:115121211 7:115717553 7:115728606 7:115728881 7:115732922 7:115736195 7:115742884 7:115745446 7:115747757 7:115752949 7:115754451 7:115758839 7:115760815 7:115764258 7:115766049 7:115767796 7:115770659 7:115778018 7:115778916 7:115783939 7:115786469 7:115786614 7:115787054 7:115795892 7:115796254 7:115796568 7:115796577 7:115798414 7:115799403
15:101802122 15:101796748 15:101797565 15:101798070 15:101800680 15:101800810 15:101800817 15:101801307 15:101801525 15:101801924 15:101802122 15:101802957 15:101803999 15:101804286 15:101806680 15:101807291 15:101807374 15:101809243 15:101809473 15:101809583 15:101809747 15:101809846 15:101811404 15:101812357 15:101816568 NA:NA NA:NA NA:NA NA:NA
14:48092448 14:48076797 14:48077220 14:48078107 14:48088532 14:48092327 14:48092448 14:48096413 14:48096883 14:48099107 14:48104473 14:48104777 14:48107294 14:48108274 14:48111243 14:48115370 14:48122276 14:48134996 14:48135150 14:48142024 14:48143526 14:48144608 NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA
12:131528491 12:131516574 12:131516713 12:131516733 12:131516770 12:131516883 12:131517005 12:131517020 12:131517066 12:131517150 12:131517651 12:131517793 12:131519612 12:131520249 12:131520675 12:131521681 12:131521694 12:131522373 12:131522451 12:131523741 12:131524764 12:131526844 12:131526894 12:131528491 12:131528903 NA:NA NA:NA NA:NA NA:NA
2:36665932 2:36656809 2:36656951 2:36657905 2:36659235 2:36660367 2:36660476 2:36660581 2:36660989 2:36662473 2:36663238 2:36664571 2:36664898 2:36665052 2:36665273 2:36665548 2:36665932 2:36667413 2:36667876 2:36668395 2:36668846 2:36669071 2:36669645 2:36669670 NA:NA NA:NA NA:NA NA:NA NA:NA
9:22877714 9:22839400 9:22841425 9:22841518 9:22848811 9:22849299 9:22850177 9:22852729 9:22854439 9:22855915 9:22861588 9:22862018 9:22862481 9:22867193 9:22873872 9:22875745 9:22876877 9:22877714 9:22878225 9:22878914 9:22889291 9:22889400 9:22889518 9:22889619 9:22890108 9:22898970 9:22900997 NA:NA NA:NA
1:207123117 1:207117558 1:207118228 1:207123117 1:207141973 1:207141987 1:207142251 1:207142507 1:207143053 1:207143296 1:207143550 NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA
12:43892862 12:43843894 12:43855134 12:43863058 12:43869655 12:43871540 12:43874891 12:43881326 12:43886205 12:43892862 12:43893000 12:43893367 12:43897876 12:43898117 12:43900108 12:43900561 12:43904333 NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA
20:55462744 20:55453181 20:55461735 20:55461900 20:55462009 20:55462033 20:55462059 20:55462092 20:55462201 20:55462241 20:55462356 20:55462451 20:55462457 20:55462468 20:55462495 20:55462612 20:55462729 20:55462744 20:55462789 20:55462796 20:55462807 20:55462898 20:55462921 20:55462971 20:55464575 NA:NA NA:NA NA:NA NA:NA
13:111858911 13:111835700 13:111837099 13:111837719 13:111837911 13:111840850 13:111842053 13:111845195 13:111845231 13:111852468 13:111852692 13:111853267 13:111856600 13:111856756 13:111858582 13:111858911 13:111869432 13:111869734 13:111871992 13:111876200 13:111878282 13:111883434 NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA
床档:
chr7 115121211 115121212 1 1
chr7 115717552 115717553 2 1
chr7 115728605 115728606 3 1
chr7 115728880 115728881 4 1
chr7 115732921 115732922 5 1
chr7 115736194 115736195 6 1
chr7 115742883 115742884 7 1
chr7 115745445 115745446 8 1
chr7 115747756 115747757 9 1
chr7 115752948 115752949 10 1
R代码:
df2bed = function(trait-regions, outDir) {
# converts data.frames to bed files
for (i in dir(CSAregions, full.names = T)) {
fileName = sapply(strsplit(i, split = "/"), tail, 1)
tmp_df = read.table(i)
tmp_bed = data.frame(chr = character(),
str = character(),
end = character(),
id = character(),
set = character(),
stringsAsFactors = F)
m = 1
for (j in 1:nrow(tmp_df)){
for (k in 1:ncol(tmp_df)){
tmp_bed[m,]$chr = paste0("chr", strsplit(tmp_df[j,k], split = ":")[[1]][1])
tmp_bed[m,]$str = as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])-1
tmp_bed[m,]$end = as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])
tmp_bed[m,]$id = m
tmp_bed[m,]$set = j
m = m + 1
}
}
# Clear NAs
tmp_bed = na.omit(tmp_bed)
write.table(tmp_bed, file = paste0(outDir, "/variant_beds/", fileName),
quote = F, row.names = F, col.names = F, sep = "\t")
}
}
谢谢!
我已经为此创建了一个 bash 代码,希望它对您来说会更快。
# get lines
IFS=$'\n' read -d '' -r -a lines < input.txt
id=0 # to keep rowid
# loop through the lines
for i in "${!lines[@]}"
do
# loop through the columns
for col in ${lines[i]}
do
# separate by colon
CHR=$(echo $col | cut -f1 -d:)
pos=$(echo $col | cut -f2 -d:)
posi=$((pos-1))
id=$((id+1))
rownumber=$((i+1))
# print to file
printf 'chr%s\t%s\t%s\t%s\t%s\n' $CHR $posi $pos $id $rownumber >> output.txt
done
done
# delete NAs
awk '!/NA/' output.txt > temp && mv temp output.txt
我基本上做的是:使用数据框 (input.txt) 读入文件,然后遍历行并获取每一列 (col)。
然后我用“:”将字符串拆分为$CHR 和$pos。最后,将您的床文件打印到输出文件 (output.txt),包括:染色体、位置 1、位置、行 ID ($id) 和提取它的原始行 ($rownumber)。创建输出文件后,我删除了所有 NA 行。
我现在也用 R 尝试过,并更改了列表的数据框。我还为 NA 添加了一个 if() 以使其也更快一些。
for (i in dir(trait.regions, full.names = T)) {
fileName = sapply(strsplit(i, split = "/"), tail, 1)
tmp_df <- read.table(fileName, stringsAsFactors = F)
tmp_bed <- list()
m = 1
for (j in 1:nrow(tmp_df)){
for (k in 1:ncol(tmp_df)){
if(is.na(match("NA", strsplit(tmp_df[j,k], split = ":")[[1]][1]))==TRUE){
tmp_bed$chr[m] <- paste0("chr", strsplit(tmp_df[j,k], split = ":")[[1]][1])
tmp_bed$str[m] <- as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])-1
tmp_bed$end[m] <- as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])
tmp_bed$id[m] <- m
tmp_bed$set[m] <- j
}
m = m + 1
}
}
tmp_bed <- do.call(cbind.data.frame, tmp_bed)
# Clear NAs
tmp_bed = na.omit(tmp_bed)
write.table(tmp_bed, file = paste0(outDir,"/variant_beds/", fileName),
quote = F, row.names = F, col.names = F, sep = "\t")
}
当我比较脚本 运行 次时,经过的时间减少了 1/4。
我在 R 中有相当大的数据框,我需要将其转换为 bed 文件。我使用下面的代码进行 df->bed 转换,但它非常慢。我想知道如何在 R 或 bash.
中以更快更智能的方式将 df 转换为 bed以下是示例数据框和床文件的前几行:
数据框:
7:115121211 7:115717553 7:115728606 7:115728881 7:115732922 7:115736195 7:115742884 7:115745446 7:115747757 7:115752949 7:115754451 7:115758839 7:115760815 7:115764258 7:115766049 7:115767796 7:115770659 7:115778018 7:115778916 7:115783939 7:115786469 7:115786614 7:115787054 7:115795892 7:115796254 7:115796568 7:115796577 7:115798414 7:115799403
15:101802122 15:101796748 15:101797565 15:101798070 15:101800680 15:101800810 15:101800817 15:101801307 15:101801525 15:101801924 15:101802122 15:101802957 15:101803999 15:101804286 15:101806680 15:101807291 15:101807374 15:101809243 15:101809473 15:101809583 15:101809747 15:101809846 15:101811404 15:101812357 15:101816568 NA:NA NA:NA NA:NA NA:NA
14:48092448 14:48076797 14:48077220 14:48078107 14:48088532 14:48092327 14:48092448 14:48096413 14:48096883 14:48099107 14:48104473 14:48104777 14:48107294 14:48108274 14:48111243 14:48115370 14:48122276 14:48134996 14:48135150 14:48142024 14:48143526 14:48144608 NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA
12:131528491 12:131516574 12:131516713 12:131516733 12:131516770 12:131516883 12:131517005 12:131517020 12:131517066 12:131517150 12:131517651 12:131517793 12:131519612 12:131520249 12:131520675 12:131521681 12:131521694 12:131522373 12:131522451 12:131523741 12:131524764 12:131526844 12:131526894 12:131528491 12:131528903 NA:NA NA:NA NA:NA NA:NA
2:36665932 2:36656809 2:36656951 2:36657905 2:36659235 2:36660367 2:36660476 2:36660581 2:36660989 2:36662473 2:36663238 2:36664571 2:36664898 2:36665052 2:36665273 2:36665548 2:36665932 2:36667413 2:36667876 2:36668395 2:36668846 2:36669071 2:36669645 2:36669670 NA:NA NA:NA NA:NA NA:NA NA:NA
9:22877714 9:22839400 9:22841425 9:22841518 9:22848811 9:22849299 9:22850177 9:22852729 9:22854439 9:22855915 9:22861588 9:22862018 9:22862481 9:22867193 9:22873872 9:22875745 9:22876877 9:22877714 9:22878225 9:22878914 9:22889291 9:22889400 9:22889518 9:22889619 9:22890108 9:22898970 9:22900997 NA:NA NA:NA
1:207123117 1:207117558 1:207118228 1:207123117 1:207141973 1:207141987 1:207142251 1:207142507 1:207143053 1:207143296 1:207143550 NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA
12:43892862 12:43843894 12:43855134 12:43863058 12:43869655 12:43871540 12:43874891 12:43881326 12:43886205 12:43892862 12:43893000 12:43893367 12:43897876 12:43898117 12:43900108 12:43900561 12:43904333 NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA
20:55462744 20:55453181 20:55461735 20:55461900 20:55462009 20:55462033 20:55462059 20:55462092 20:55462201 20:55462241 20:55462356 20:55462451 20:55462457 20:55462468 20:55462495 20:55462612 20:55462729 20:55462744 20:55462789 20:55462796 20:55462807 20:55462898 20:55462921 20:55462971 20:55464575 NA:NA NA:NA NA:NA NA:NA
13:111858911 13:111835700 13:111837099 13:111837719 13:111837911 13:111840850 13:111842053 13:111845195 13:111845231 13:111852468 13:111852692 13:111853267 13:111856600 13:111856756 13:111858582 13:111858911 13:111869432 13:111869734 13:111871992 13:111876200 13:111878282 13:111883434 NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA NA:NA
床档:
chr7 115121211 115121212 1 1
chr7 115717552 115717553 2 1
chr7 115728605 115728606 3 1
chr7 115728880 115728881 4 1
chr7 115732921 115732922 5 1
chr7 115736194 115736195 6 1
chr7 115742883 115742884 7 1
chr7 115745445 115745446 8 1
chr7 115747756 115747757 9 1
chr7 115752948 115752949 10 1
R代码:
df2bed = function(trait-regions, outDir) {
# converts data.frames to bed files
for (i in dir(CSAregions, full.names = T)) {
fileName = sapply(strsplit(i, split = "/"), tail, 1)
tmp_df = read.table(i)
tmp_bed = data.frame(chr = character(),
str = character(),
end = character(),
id = character(),
set = character(),
stringsAsFactors = F)
m = 1
for (j in 1:nrow(tmp_df)){
for (k in 1:ncol(tmp_df)){
tmp_bed[m,]$chr = paste0("chr", strsplit(tmp_df[j,k], split = ":")[[1]][1])
tmp_bed[m,]$str = as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])-1
tmp_bed[m,]$end = as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])
tmp_bed[m,]$id = m
tmp_bed[m,]$set = j
m = m + 1
}
}
# Clear NAs
tmp_bed = na.omit(tmp_bed)
write.table(tmp_bed, file = paste0(outDir, "/variant_beds/", fileName),
quote = F, row.names = F, col.names = F, sep = "\t")
}
}
谢谢!
我已经为此创建了一个 bash 代码,希望它对您来说会更快。
# get lines
IFS=$'\n' read -d '' -r -a lines < input.txt
id=0 # to keep rowid
# loop through the lines
for i in "${!lines[@]}"
do
# loop through the columns
for col in ${lines[i]}
do
# separate by colon
CHR=$(echo $col | cut -f1 -d:)
pos=$(echo $col | cut -f2 -d:)
posi=$((pos-1))
id=$((id+1))
rownumber=$((i+1))
# print to file
printf 'chr%s\t%s\t%s\t%s\t%s\n' $CHR $posi $pos $id $rownumber >> output.txt
done
done
# delete NAs
awk '!/NA/' output.txt > temp && mv temp output.txt
我基本上做的是:使用数据框 (input.txt) 读入文件,然后遍历行并获取每一列 (col)。 然后我用“:”将字符串拆分为$CHR 和$pos。最后,将您的床文件打印到输出文件 (output.txt),包括:染色体、位置 1、位置、行 ID ($id) 和提取它的原始行 ($rownumber)。创建输出文件后,我删除了所有 NA 行。
我现在也用 R 尝试过,并更改了列表的数据框。我还为 NA 添加了一个 if() 以使其也更快一些。
for (i in dir(trait.regions, full.names = T)) {
fileName = sapply(strsplit(i, split = "/"), tail, 1)
tmp_df <- read.table(fileName, stringsAsFactors = F)
tmp_bed <- list()
m = 1
for (j in 1:nrow(tmp_df)){
for (k in 1:ncol(tmp_df)){
if(is.na(match("NA", strsplit(tmp_df[j,k], split = ":")[[1]][1]))==TRUE){
tmp_bed$chr[m] <- paste0("chr", strsplit(tmp_df[j,k], split = ":")[[1]][1])
tmp_bed$str[m] <- as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])-1
tmp_bed$end[m] <- as.numeric(strsplit(tmp_df[j,k], split = ":")[[1]][2])
tmp_bed$id[m] <- m
tmp_bed$set[m] <- j
}
m = m + 1
}
}
tmp_bed <- do.call(cbind.data.frame, tmp_bed)
# Clear NAs
tmp_bed = na.omit(tmp_bed)
write.table(tmp_bed, file = paste0(outDir,"/variant_beds/", fileName),
quote = F, row.names = F, col.names = F, sep = "\t")
}
当我比较脚本 运行 次时,经过的时间减少了 1/4。