文件解析中 perl 中的循环打印问题

Issue with loop printing in perl in file parsing

我从 phobius 得到的结果如下所示

ID   sp|Q92673|1-2157
FT   SIGNAL        1     28       
FT   DOMAIN        1     11       N-REGION.
FT   DOMAIN       12     22       H-REGION.
FT   DOMAIN       23     28       C-REGION.
FT   DOMAIN       29   2135       NON CYTOPLASMIC.
FT   TRANSMEM   2136   2156       
FT   DOMAIN     2157   2157       CYTOPLASMIC.
//
---------------------------------------------------------------------
ID   sp|Q5SSG8|25-479
FT   DOMAIN        1    455       NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID   sp|Q92854|22-734
FT   DOMAIN        1    713       NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID   sp|Q9Y5E9|27-686
FT   DOMAIN        1    660       NON CYTOPLASMIC.
// 
---------------------------------------------------------------------
ID   sp|Q9Y6N8|55-613
FT   DOMAIN        1    559       NON CYTOPLASMIC.
//

我希望在每行前面为由 \ 分隔的每个结果打印相应的 Uniprot ID。

这是我创建的 perl 片段

open (MYFILE, "result_phobius.txt" )||warn "Couldn't open file because $!"; #give input file name
open (FILE, ">output.txt"); #output file name
while (<MYFILE>)
{
    if ($_=~/^ID   (\S+?)\s/) #search accession number started by > and terminate at white space
    {
        $id=;
        chomp ($id);
        print FILE "$id\t"; #will print accession number in a column
    }
        if ($_=~/^FT   /)
        
    {
        print FILE "$_";
        
    }
}

这仅在第一行打印 ID,即它在具有单个域的结果中工作得很好,但如果有多个域则失败。

例如

FT   SIGNAL        1     28       
FT   DOMAIN        1     11       N-REGION.
FT   DOMAIN       12     22       H-REGION.
FT   DOMAIN       23     28       C-REGION.
FT   DOMAIN       29   2135       NON CYTOPLASMIC.
FT   TRANSMEM   2136   2156       
FT   DOMAIN     2157   2157       CYTOPLASMIC.
sp|Q5SSG8|25-479    FT   DOMAIN        1    455       NON CYTOPLASMIC.
sp|Q92854|22-734    FT   DOMAIN        1    713       NON CYTOPLASMIC.
sp|Q9Y5E9|27-686    FT   DOMAIN        1    660       NON CYTOPLASMIC.
sp|Q9Y6N8|55-613    FT   DOMAIN        1    559       NON CYTOPLASMIC.
sp|Q02763|23-748    FT   DOMAIN        1    726       NON CYTOPLASMIC.
sp|Q14517|22-4181   FT   DOMAIN        1   4160       NON CYTOPLASMIC.
sp|O75051|35-1237   FT   DOMAIN        1   1203       NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT   DOMAIN        1    119       CYTOPLASMIC.
FT   TRANSMEM    120    144       
FT   DOMAIN      145    145       NON CYTOPLASMIC.

如何让它适用于多个条目。

预期输出

sp|Q92673|1-2157    FT   SIGNAL        1     28       
sp|Q92673|1-2157    FT   DOMAIN        1     11       N-REGION.
sp|Q92673|1-2157    FT   DOMAIN       12     22       H-REGION.
sp|Q92673|1-2157    FT   DOMAIN       23     28       C-REGION.
sp|Q92673|1-2157    FT   DOMAIN       29   2135       NON CYTOPLASMIC.
sp|Q92673|1-2157    FT   TRANSMEM   2136   2156       
sp|Q92673|1-2157    FT   DOMAIN     2157   2157       CYTOPLASMIC.
sp|Q5SSG8|25-479    FT   DOMAIN        1    455       NON CYTOPLASMIC.
sp|Q92854|22-734    FT   DOMAIN        1    713       NON CYTOPLASMIC.
sp|Q9Y5E9|27-686    FT   DOMAIN        1    660       NON CYTOPLASMIC.
sp|Q9Y6N8|55-613    FT   DOMAIN        1    559       NON CYTOPLASMIC.
sp|Q02763|23-748    FT   DOMAIN        1    726       NON CYTOPLASMIC.
sp|Q14517|22-4181   FT   DOMAIN        1   4160       NON CYTOPLASMIC.
sp|O75051|35-1237   FT   DOMAIN        1   1203       NON CYTOPLASMIC.
tr|D3DPA4|1-145     FT   DOMAIN        1    119       CYTOPLASMIC.
tr|D3DPA4|1-145     FT   TRANSMEM    120    144       
tr|D3DPA4|1-145     FT   DOMAIN      145    145       NON CYTOPLASMIC.

提前感谢您的帮助

只需将 print FILE "$id\t" 移动到另一个 if 块中,即仅在指定时填充 $id,为每个域打印它。

您可以在打印 $id 之前添加一个检查以确保它不为空,但如果我正确理解格式,就不会发生这种情况。

   if (/^ID   (\S+?)\s/)
   {
        $id = ;
   }
   if (/^FT   /)
   {
        print FILE "$id\t$_";
   }