如何使用 python 正则表达式分离每个爆炸结果并将其存储在列表中以供进一步分析

Question

我正在研究一组涉及使用 ncbi-blast 的生物序列。我需要一些帮助来使用 python 正则表达式处理输出文件。包含多个输出（序列分析结果）的文本结果看起来像这样，

Query= lcl|TRINITY_DN2888_c0_g2_i1

Length=1394 Score E Sequences producing significant alignments:
(Bits) Value

sp|Q9S775|PKL_ARATH

CHD3-type chromatin-remodeling factor PICKLE... 1640 0.0

sp|Q9S775|PKL_ARATH CHD3-type chromatin-remodeling factor PICKLE OS=Arabidopsis thaliana OX=3702 GN=PKL PE=1 SV=1 Length=1384

Score = 1640 bits (4248), Expect = 0.0, Method: Compositional matrix adjust. Identities = 830/1348 (62%), Positives = 1036/1348 (77%), Gaps = 53/1348 (4%)

Query 1
MSSLVERLRVRSERRPLYTDDDSDDDLYAARGGSESKQEERPPERIVRDDAKNDTCKTCG 60 MSSLVERLR+RS+R+P+Y DDSDDD + + +Q E IVR DAK + C+ CG Sbjct 1
MSSLVERLRIRSDRKPVYNLDDSDDDDFVPKKDRTFEQ----VEAIVRTDAKENACQACG 56

Lambda K H a alpha 0.317 0.134 0.389 0.792 4.96

Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6

Effective search space used: 160862965056

Query= lcl|TRINITY_DN2855_c0_g1_i1

Length=145 ........................................ ................................................... ...................................................

我想提取从“Query=lcl|TRINITY_DN2888_c0_g2_i1”开始的信息到下一个查询“Query=lcl|TRINITY_DN2855_c0_g1_i1" 并将其存储在 python 列表中以供进一步分析（因为整个文件包含很少数以千计的查询结果）。是否有可以执行此操作的 python 正则表达式代码？

这是我的代码：

#!/user/bin/python3
file=open("path/file_name","r+")
import re
inter=file.read()
lst=[]
lst=re.findall(r'>(.*)>',inter,re.DOTALL)
print(lst)
for x in lst:
    print(x)

我得到了错误的输出，因为代码打印了文件中存在的全部信息（数千个），而不是一次只提取一个结果。

谢谢

Answer 1

要获得您想要的结果，请使用 re.split():

将 re.findall() 方法调用的行编辑为以下内容

lst=re.split(r'(>Query\=.*)?',inter,re.DOTALL)

有关 re.split() 的更多信息，请参阅此内容：

https://docs.python.org/2/library/re.html

此外，您可能需要考虑在 biopython:

中使用现已弃用的 BLAST 解析器

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96

The plain text BLAST parser is located in Bio.Blast.NCBIStandalone.

As with the XML parser, we need to have a handle object that we can pass to the parser. The handle must implement the readline() method and do this properly. The common ways to get such a handle are to either use the provided blastall or blastpgp functions to run the local blast, or to run a local blast via the command line, and then do something like the following:

result_handle = open("my_file_of_blast_output.txt")

好吧，现在我们有了一个句柄（我们称之为 result_handle），我们准备解析它。这可以通过以下代码完成：

>>> from Bio.Blast import NCBIStandalone
>>> blast_parser = NCBIStandalone.BlastParser()
>>> blast_record = blast_parser.parse(result_handle)

This will parse the BLAST report into a Blast Record class (either a Blast or a PSIBlast record, depending on what you are parsing) so that you can extract the information from it. In our case, let’s just print out a quick summary of all of the alignments greater than some threshold value.

>>> E_VALUE_THRESH = 0.04
>>> for alignment in blast_record.alignments: 
...     for hsp in alignment.hsps: 
...         if hsp.expect < E_VALUE_THRESH: 
...             print('****Alignment****') 
...             print('sequence:', alignment.title) 
...             print('length:', alignment.length)
...             print('e value:', hsp.expect) 
...             print(hsp.query[0:75] + '...') 
...             print(hsp.match[0:75] + '...') 
...             print(hsp.sbjct[0:75] + '...')

If you also read the section 7.3 on parsing BLAST XML output, you’ll notice that the above code is identical to what is found in that section. Once you parse something into a record class you can deal with it independent of the format of the original BLAST info you were parsing. Pretty snazzy!

Answer 2

我终于找到了将大文件分成小块的解决方案，这样我就可以使用 python 正则表达式处理单个查询结果...这是我的代码...

#!/user/bin/python3
file=open("/path/file_name.txt","r+")
import re
inter=file.read()
lst=re.findall('(?<=Query= lcl)(.*?)(?=Effective search space)', inter, flags=re.S)
print(lst)

谢谢大家帮助我...

如何使用 python 正则表达式分离每个爆炸结果并将其存储在列表中以供进一步分析

How to separate each blast result using python regex and store it in a list for further analysis

python

regex

bioinformatics

biopython