从几个文件中复制字符串并将其粘贴到 bash 中的新文件中

Question

我有几个包含 fasta 数据的文件。所有文件都在同一个目录中，但名称不同。文件 1

>gene1
AAAAAAAAAAAAAAAAAAAA
>gene2
GGGGGGGGGGGGGGGGGGGG

文件 2

>gene1
CCCCCCCCCCCCCCCCCCCC
>gene2
TTTTTTTTTTTTTTTTTTTT

我想为每个基因创建一个新文件。文件名将是基因名，它应该是这样的

基因1

>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC

Answer 1

能否请您尝试以下。仅使用提供的示例进行测试和编写。

awk '
/^>/{
  sub(/^>/,"")
  file=[=10=]
  print ">"FILENAME >> (file)
  next
}
{
  print >> (file)
  close(file)
}
' file*

对于提供的示例，它将创建 2 个名为 gene1 和 gene2 的输出文件，如下所示。

cat gene1
>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC

cat gene2
>file1
GGGGGGGGGGGGGGGGGGGG
>file2
TTTTTTTTTTTTTTTTTTTT

说明：在此处添加对上述代码的说明。

awk '                              ##Starting awk program from here.
/^>/{                              ##Checking a condition if a line starts from > as per samples.
  sub(/^>/,"")                     ##Substituting that starting > with NULL here.
  file=[=12=]                          ##Creating a variable named file whose value is current line.
  print ">"FILENAME >> (file)      ##Printing string > and awk variable FILENAME to output file variable named file; created in previous line.
  next                             ##next will skip all further lines from here.
}                                  ##Closing BLOCK for /^>/ condition here.
{                                  ##Starting BLOCK for here which will be executed on each line of Input_file part from lines which start from >
  print >> (file)                  ##Printing current line to output file named variable file value here.
  close(file)                      ##Using close; to close the output file in back-end, to avoid too many files opened error.
}                                  ##Closing BLOCK as mentioned above for this program.
' file*                            ##Passing all files here.

Answer 2

对于你的问题，几乎没有什么假设，

每个"gene"有一个header，从>开始
然后是一行内容（或更多）
假设有超过2个文件，超过2个基因

这是任何程序检测模式并进行过滤/拆分的条件

伪码

for files in folder
  for line in file
    if it's gene, save as target_file_name
    if not, push current_file_name and current_line to target_file_name

让我知道这是否满足您的要求，或者您需要进一步的实现/详细代码，bash 或 awk 都应该有效。

从几个文件中复制字符串并将其粘贴到 bash 中的新文件中

copy strings from a few files and paste it into a new file in bash

unix

bash

awk

bioinformatics

fasta