从几个文件中复制字符串并将其粘贴到 bash 中的新文件中

copy strings from a few files and paste it into a new file in bash

我有几个包含 fasta 数据的文件。所有文件都在同一个目录中,但名称不同。 文件 1

>gene1
AAAAAAAAAAAAAAAAAAAA
>gene2
GGGGGGGGGGGGGGGGGGGG

文件 2

>gene1
CCCCCCCCCCCCCCCCCCCC
>gene2
TTTTTTTTTTTTTTTTTTTT

我想为每个基因创建一个新文件。文件名将是基因名,它应该是这样的

基因1

>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC

能否请您尝试以下。仅使用提供的示例进行测试和编写。

awk '
/^>/{
  sub(/^>/,"")
  file=[=10=]
  print ">"FILENAME >> (file)
  next
}
{
  print >> (file)
  close(file)
}
' file*

对于提供的示例,它将创建 2 个名为 gene1gene2 的输出文件,如下所示。

cat gene1
>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC

cat gene2
>file1
GGGGGGGGGGGGGGGGGGGG
>file2
TTTTTTTTTTTTTTTTTTTT

说明:在此处添加对上述代码的说明。

awk '                              ##Starting awk program from here.
/^>/{                              ##Checking a condition if a line starts from > as per samples.
  sub(/^>/,"")                     ##Substituting that starting > with NULL here.
  file=[=12=]                          ##Creating a variable named file whose value is current line.
  print ">"FILENAME >> (file)      ##Printing string > and awk variable FILENAME to output file variable named file; created in previous line.
  next                             ##next will skip all further lines from here.
}                                  ##Closing BLOCK for /^>/ condition here.
{                                  ##Starting BLOCK for here which will be executed on each line of Input_file part from lines which start from >
  print >> (file)                  ##Printing current line to output file named variable file value here.
  close(file)                      ##Using close; to close the output file in back-end, to avoid too many files opened error.
}                                  ##Closing BLOCK as mentioned above for this program.
' file*                            ##Passing all files here.

对于你的问题,几乎没有什么假设,

  • 每个"gene"有一个header,从>开始
  • 然后是一行内容(或更多)
  • 假设有超过2个文件,超过2个基因

这是任何程序检测模式并进行过滤/拆分的条件

伪码

for files in folder
  for line in file
    if it's gene, save as target_file_name
    if not, push current_file_name and current_line to target_file_name

让我知道这是否满足您的要求,或者您需要进一步的实现/详细代码,bashawk 都应该有效。