运行 具有不同输出文件的多个文件的数组 linux
Running an array on multiple files with different output files linux
我想将 8 个文件(每个文件代表一个染色体)分解成大约 5 个 4e8 行的块,每个文件大约有 2e9 行。这些是 VCF 文件 (https://en.wikipedia.org/wiki/Variant_Call_Format),其中有一个 header,然后是遗传变异,所以我需要为每个文件保留 header,并将它们重新附加到特定的染色体 header .我在 HPC 上 linux 中执行此操作。
我在使用之前用一个文件完成了这个:
#grab the header
head -n 10000 my.vcf | grep "^#" >header
#grab the non header lines
grep -v "^#" my.vcf >variants
#split into chunks with 40000000 lines
split -l 40000000 variants
#reattach the header to each and clean up
for i in x*;do cat header $i >$i.vcf && rm -f $i;done
rm -f header variants
我可以用所有 8 条染色体手动完成此操作,但我在具有数组功能的 HPC 中工作,并且觉得使用 for 循环可以更好地完成此操作,但是语法让我有点困惑。
我试过:
#filelist is a list of the 8 chromosome files i.e. chr001.vcf, chr002.vcf...chr0008.vcf
for f in 'cat filelist.txt'; head -n 10000 my.vcf | grep "^#" >header; done
这会将所有内容放入同一个 header。我如何将输出放入每个染色体 headers 中唯一?同样,这将如何拆分变体并将 headers 重新附加到每个染色体的每个块?
所需的输出将是:
chr001_chunk1.vcf
chr001_chunk2.vcf
chr001_chunk3.vcf
chr001_chunk4.vcf
chr001_chunk5.vcf
...
chr008_chunk5.vcf
每个 vcf 块都具有来自各自染色体的 headerparent”。
非常感谢
#!/bin/bash
#
# scan the current directory for chr[0-9]*.vcf
# extract header lines (^#)
# extract variants (non-header lines) and split to 40m partial files
# combine header with each partial file
#
# for tuning
lines=40000000
vcf_list=(chr[0-9]*.vcf)
if [ ${#vcf_list} -eq 0 ]; then
echo no .vcf files
exit 1
fi
tmpv=variants
hdr=header
for chrfile in "${vcf_list[@]}"; do
# isolate without . extn
base=${chrfile%%.*}
echo $chrfile
# extract header lines
head -1000 $chrfile | grep "^#" > $hdr
# extract variants
grep -v "^#" $chrfile > $tmpv
#
# split variants into files with max $lines;
# output files are created with a filter to combine header data and
# partial variant data in 1 pass, avoiding additional file I/O;
# output files are named with a leading 'p' to support multiple
# runs without filename collision
#
split -d -l $lines $tmpv p${base}_chunk --additional-suffix=.vcf \
--filter="cat $hdr - > $FILE; echo \" $FILE\""
done
rm -f $tmpv $hdr
exit 0
我想将 8 个文件(每个文件代表一个染色体)分解成大约 5 个 4e8 行的块,每个文件大约有 2e9 行。这些是 VCF 文件 (https://en.wikipedia.org/wiki/Variant_Call_Format),其中有一个 header,然后是遗传变异,所以我需要为每个文件保留 header,并将它们重新附加到特定的染色体 header .我在 HPC 上 linux 中执行此操作。
我在使用之前用一个文件完成了这个:
#grab the header
head -n 10000 my.vcf | grep "^#" >header
#grab the non header lines
grep -v "^#" my.vcf >variants
#split into chunks with 40000000 lines
split -l 40000000 variants
#reattach the header to each and clean up
for i in x*;do cat header $i >$i.vcf && rm -f $i;done
rm -f header variants
我可以用所有 8 条染色体手动完成此操作,但我在具有数组功能的 HPC 中工作,并且觉得使用 for 循环可以更好地完成此操作,但是语法让我有点困惑。
我试过:
#filelist is a list of the 8 chromosome files i.e. chr001.vcf, chr002.vcf...chr0008.vcf
for f in 'cat filelist.txt'; head -n 10000 my.vcf | grep "^#" >header; done
这会将所有内容放入同一个 header。我如何将输出放入每个染色体 headers 中唯一?同样,这将如何拆分变体并将 headers 重新附加到每个染色体的每个块?
所需的输出将是:
chr001_chunk1.vcf
chr001_chunk2.vcf
chr001_chunk3.vcf
chr001_chunk4.vcf
chr001_chunk5.vcf
...
chr008_chunk5.vcf
每个 vcf 块都具有来自各自染色体的 headerparent”。
非常感谢
#!/bin/bash
#
# scan the current directory for chr[0-9]*.vcf
# extract header lines (^#)
# extract variants (non-header lines) and split to 40m partial files
# combine header with each partial file
#
# for tuning
lines=40000000
vcf_list=(chr[0-9]*.vcf)
if [ ${#vcf_list} -eq 0 ]; then
echo no .vcf files
exit 1
fi
tmpv=variants
hdr=header
for chrfile in "${vcf_list[@]}"; do
# isolate without . extn
base=${chrfile%%.*}
echo $chrfile
# extract header lines
head -1000 $chrfile | grep "^#" > $hdr
# extract variants
grep -v "^#" $chrfile > $tmpv
#
# split variants into files with max $lines;
# output files are created with a filter to combine header data and
# partial variant data in 1 pass, avoiding additional file I/O;
# output files are named with a leading 'p' to support multiple
# runs without filename collision
#
split -d -l $lines $tmpv p${base}_chunk --additional-suffix=.vcf \
--filter="cat $hdr - > $FILE; echo \" $FILE\""
done
rm -f $tmpv $hdr
exit 0