文件解析中 perl 中的循环打印问题
Issue with loop printing in perl in file parsing
我从 phobius 得到的结果如下所示
ID sp|Q92673|1-2157
FT SIGNAL 1 28
FT DOMAIN 1 11 N-REGION.
FT DOMAIN 12 22 H-REGION.
FT DOMAIN 23 28 C-REGION.
FT DOMAIN 29 2135 NON CYTOPLASMIC.
FT TRANSMEM 2136 2156
FT DOMAIN 2157 2157 CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q5SSG8|25-479
FT DOMAIN 1 455 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q92854|22-734
FT DOMAIN 1 713 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q9Y5E9|27-686
FT DOMAIN 1 660 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q9Y6N8|55-613
FT DOMAIN 1 559 NON CYTOPLASMIC.
//
我希望在每行前面为由 \
分隔的每个结果打印相应的 Uniprot ID。
这是我创建的 perl 片段
open (MYFILE, "result_phobius.txt" )||warn "Couldn't open file because $!"; #give input file name
open (FILE, ">output.txt"); #output file name
while (<MYFILE>)
{
if ($_=~/^ID (\S+?)\s/) #search accession number started by > and terminate at white space
{
$id=;
chomp ($id);
print FILE "$id\t"; #will print accession number in a column
}
if ($_=~/^FT /)
{
print FILE "$_";
}
}
这仅在第一行打印 ID,即它在具有单个域的结果中工作得很好,但如果有多个域则失败。
例如
FT SIGNAL 1 28
FT DOMAIN 1 11 N-REGION.
FT DOMAIN 12 22 H-REGION.
FT DOMAIN 23 28 C-REGION.
FT DOMAIN 29 2135 NON CYTOPLASMIC.
FT TRANSMEM 2136 2156
FT DOMAIN 2157 2157 CYTOPLASMIC.
sp|Q5SSG8|25-479 FT DOMAIN 1 455 NON CYTOPLASMIC.
sp|Q92854|22-734 FT DOMAIN 1 713 NON CYTOPLASMIC.
sp|Q9Y5E9|27-686 FT DOMAIN 1 660 NON CYTOPLASMIC.
sp|Q9Y6N8|55-613 FT DOMAIN 1 559 NON CYTOPLASMIC.
sp|Q02763|23-748 FT DOMAIN 1 726 NON CYTOPLASMIC.
sp|Q14517|22-4181 FT DOMAIN 1 4160 NON CYTOPLASMIC.
sp|O75051|35-1237 FT DOMAIN 1 1203 NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT DOMAIN 1 119 CYTOPLASMIC.
FT TRANSMEM 120 144
FT DOMAIN 145 145 NON CYTOPLASMIC.
如何让它适用于多个条目。
预期输出
sp|Q92673|1-2157 FT SIGNAL 1 28
sp|Q92673|1-2157 FT DOMAIN 1 11 N-REGION.
sp|Q92673|1-2157 FT DOMAIN 12 22 H-REGION.
sp|Q92673|1-2157 FT DOMAIN 23 28 C-REGION.
sp|Q92673|1-2157 FT DOMAIN 29 2135 NON CYTOPLASMIC.
sp|Q92673|1-2157 FT TRANSMEM 2136 2156
sp|Q92673|1-2157 FT DOMAIN 2157 2157 CYTOPLASMIC.
sp|Q5SSG8|25-479 FT DOMAIN 1 455 NON CYTOPLASMIC.
sp|Q92854|22-734 FT DOMAIN 1 713 NON CYTOPLASMIC.
sp|Q9Y5E9|27-686 FT DOMAIN 1 660 NON CYTOPLASMIC.
sp|Q9Y6N8|55-613 FT DOMAIN 1 559 NON CYTOPLASMIC.
sp|Q02763|23-748 FT DOMAIN 1 726 NON CYTOPLASMIC.
sp|Q14517|22-4181 FT DOMAIN 1 4160 NON CYTOPLASMIC.
sp|O75051|35-1237 FT DOMAIN 1 1203 NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT DOMAIN 1 119 CYTOPLASMIC.
tr|D3DPA4|1-145 FT TRANSMEM 120 144
tr|D3DPA4|1-145 FT DOMAIN 145 145 NON CYTOPLASMIC.
提前感谢您的帮助
只需将 print FILE "$id\t"
移动到另一个 if
块中,即仅在指定时填充 $id,为每个域打印它。
您可以在打印 $id 之前添加一个检查以确保它不为空,但如果我正确理解格式,就不会发生这种情况。
if (/^ID (\S+?)\s/)
{
$id = ;
}
if (/^FT /)
{
print FILE "$id\t$_";
}
我从 phobius 得到的结果如下所示
ID sp|Q92673|1-2157
FT SIGNAL 1 28
FT DOMAIN 1 11 N-REGION.
FT DOMAIN 12 22 H-REGION.
FT DOMAIN 23 28 C-REGION.
FT DOMAIN 29 2135 NON CYTOPLASMIC.
FT TRANSMEM 2136 2156
FT DOMAIN 2157 2157 CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q5SSG8|25-479
FT DOMAIN 1 455 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q92854|22-734
FT DOMAIN 1 713 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q9Y5E9|27-686
FT DOMAIN 1 660 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q9Y6N8|55-613
FT DOMAIN 1 559 NON CYTOPLASMIC.
//
我希望在每行前面为由 \
分隔的每个结果打印相应的 Uniprot ID。
这是我创建的 perl 片段
open (MYFILE, "result_phobius.txt" )||warn "Couldn't open file because $!"; #give input file name
open (FILE, ">output.txt"); #output file name
while (<MYFILE>)
{
if ($_=~/^ID (\S+?)\s/) #search accession number started by > and terminate at white space
{
$id=;
chomp ($id);
print FILE "$id\t"; #will print accession number in a column
}
if ($_=~/^FT /)
{
print FILE "$_";
}
}
这仅在第一行打印 ID,即它在具有单个域的结果中工作得很好,但如果有多个域则失败。
例如
FT SIGNAL 1 28
FT DOMAIN 1 11 N-REGION.
FT DOMAIN 12 22 H-REGION.
FT DOMAIN 23 28 C-REGION.
FT DOMAIN 29 2135 NON CYTOPLASMIC.
FT TRANSMEM 2136 2156
FT DOMAIN 2157 2157 CYTOPLASMIC.
sp|Q5SSG8|25-479 FT DOMAIN 1 455 NON CYTOPLASMIC.
sp|Q92854|22-734 FT DOMAIN 1 713 NON CYTOPLASMIC.
sp|Q9Y5E9|27-686 FT DOMAIN 1 660 NON CYTOPLASMIC.
sp|Q9Y6N8|55-613 FT DOMAIN 1 559 NON CYTOPLASMIC.
sp|Q02763|23-748 FT DOMAIN 1 726 NON CYTOPLASMIC.
sp|Q14517|22-4181 FT DOMAIN 1 4160 NON CYTOPLASMIC.
sp|O75051|35-1237 FT DOMAIN 1 1203 NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT DOMAIN 1 119 CYTOPLASMIC.
FT TRANSMEM 120 144
FT DOMAIN 145 145 NON CYTOPLASMIC.
如何让它适用于多个条目。
预期输出
sp|Q92673|1-2157 FT SIGNAL 1 28
sp|Q92673|1-2157 FT DOMAIN 1 11 N-REGION.
sp|Q92673|1-2157 FT DOMAIN 12 22 H-REGION.
sp|Q92673|1-2157 FT DOMAIN 23 28 C-REGION.
sp|Q92673|1-2157 FT DOMAIN 29 2135 NON CYTOPLASMIC.
sp|Q92673|1-2157 FT TRANSMEM 2136 2156
sp|Q92673|1-2157 FT DOMAIN 2157 2157 CYTOPLASMIC.
sp|Q5SSG8|25-479 FT DOMAIN 1 455 NON CYTOPLASMIC.
sp|Q92854|22-734 FT DOMAIN 1 713 NON CYTOPLASMIC.
sp|Q9Y5E9|27-686 FT DOMAIN 1 660 NON CYTOPLASMIC.
sp|Q9Y6N8|55-613 FT DOMAIN 1 559 NON CYTOPLASMIC.
sp|Q02763|23-748 FT DOMAIN 1 726 NON CYTOPLASMIC.
sp|Q14517|22-4181 FT DOMAIN 1 4160 NON CYTOPLASMIC.
sp|O75051|35-1237 FT DOMAIN 1 1203 NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT DOMAIN 1 119 CYTOPLASMIC.
tr|D3DPA4|1-145 FT TRANSMEM 120 144
tr|D3DPA4|1-145 FT DOMAIN 145 145 NON CYTOPLASMIC.
提前感谢您的帮助
只需将 print FILE "$id\t"
移动到另一个 if
块中,即仅在指定时填充 $id,为每个域打印它。
您可以在打印 $id 之前添加一个检查以确保它不为空,但如果我正确理解格式,就不会发生这种情况。
if (/^ID (\S+?)\s/)
{
$id = ;
}
if (/^FT /)
{
print FILE "$id\t$_";
}