从几个文件中复制字符串并将其粘贴到 bash 中的新文件中
copy strings from a few files and paste it into a new file in bash
我有几个包含 fasta 数据的文件。所有文件都在同一个目录中,但名称不同。
文件 1
>gene1
AAAAAAAAAAAAAAAAAAAA
>gene2
GGGGGGGGGGGGGGGGGGGG
文件 2
>gene1
CCCCCCCCCCCCCCCCCCCC
>gene2
TTTTTTTTTTTTTTTTTTTT
我想为每个基因创建一个新文件。文件名将是基因名,它应该是这样的
基因1
>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC
能否请您尝试以下。仅使用提供的示例进行测试和编写。
awk '
/^>/{
sub(/^>/,"")
file=[=10=]
print ">"FILENAME >> (file)
next
}
{
print >> (file)
close(file)
}
' file*
对于提供的示例,它将创建 2 个名为 gene1
和 gene2
的输出文件,如下所示。
cat gene1
>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC
cat gene2
>file1
GGGGGGGGGGGGGGGGGGGG
>file2
TTTTTTTTTTTTTTTTTTTT
说明:在此处添加对上述代码的说明。
awk ' ##Starting awk program from here.
/^>/{ ##Checking a condition if a line starts from > as per samples.
sub(/^>/,"") ##Substituting that starting > with NULL here.
file=[=12=] ##Creating a variable named file whose value is current line.
print ">"FILENAME >> (file) ##Printing string > and awk variable FILENAME to output file variable named file; created in previous line.
next ##next will skip all further lines from here.
} ##Closing BLOCK for /^>/ condition here.
{ ##Starting BLOCK for here which will be executed on each line of Input_file part from lines which start from >
print >> (file) ##Printing current line to output file named variable file value here.
close(file) ##Using close; to close the output file in back-end, to avoid too many files opened error.
} ##Closing BLOCK as mentioned above for this program.
' file* ##Passing all files here.
对于你的问题,几乎没有什么假设,
- 每个"gene"有一个header,从
>
开始
- 然后是一行内容(或更多)
- 假设有超过2个文件,超过2个基因
这是任何程序检测模式并进行过滤/拆分的条件
伪码
for files in folder
for line in file
if it's gene, save as target_file_name
if not, push current_file_name and current_line to target_file_name
让我知道这是否满足您的要求,或者您需要进一步的实现/详细代码,bash
或 awk
都应该有效。
我有几个包含 fasta 数据的文件。所有文件都在同一个目录中,但名称不同。 文件 1
>gene1
AAAAAAAAAAAAAAAAAAAA
>gene2
GGGGGGGGGGGGGGGGGGGG
文件 2
>gene1
CCCCCCCCCCCCCCCCCCCC
>gene2
TTTTTTTTTTTTTTTTTTTT
我想为每个基因创建一个新文件。文件名将是基因名,它应该是这样的
基因1
>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC
能否请您尝试以下。仅使用提供的示例进行测试和编写。
awk '
/^>/{
sub(/^>/,"")
file=[=10=]
print ">"FILENAME >> (file)
next
}
{
print >> (file)
close(file)
}
' file*
对于提供的示例,它将创建 2 个名为 gene1
和 gene2
的输出文件,如下所示。
cat gene1
>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC
cat gene2
>file1
GGGGGGGGGGGGGGGGGGGG
>file2
TTTTTTTTTTTTTTTTTTTT
说明:在此处添加对上述代码的说明。
awk ' ##Starting awk program from here.
/^>/{ ##Checking a condition if a line starts from > as per samples.
sub(/^>/,"") ##Substituting that starting > with NULL here.
file=[=12=] ##Creating a variable named file whose value is current line.
print ">"FILENAME >> (file) ##Printing string > and awk variable FILENAME to output file variable named file; created in previous line.
next ##next will skip all further lines from here.
} ##Closing BLOCK for /^>/ condition here.
{ ##Starting BLOCK for here which will be executed on each line of Input_file part from lines which start from >
print >> (file) ##Printing current line to output file named variable file value here.
close(file) ##Using close; to close the output file in back-end, to avoid too many files opened error.
} ##Closing BLOCK as mentioned above for this program.
' file* ##Passing all files here.
对于你的问题,几乎没有什么假设,
- 每个"gene"有一个header,从
>
开始 - 然后是一行内容(或更多)
- 假设有超过2个文件,超过2个基因
这是任何程序检测模式并进行过滤/拆分的条件
伪码
for files in folder
for line in file
if it's gene, save as target_file_name
if not, push current_file_name and current_line to target_file_name
让我知道这是否满足您的要求,或者您需要进一步的实现/详细代码,bash
或 awk
都应该有效。