在 Python 中的字符串中找到模式匹配

find a Pattern Match in string in Python

我正在尝试在蛋白质序列中找到氨基酸模式(B-C 或 M-D,其中“-”可以是除 'P' 之外的任何字母),比如说 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'。 fasta 文件中的蛋白质序列。

我已经尝试了很多,但找不到任何解决方案。

我尝试了很多。下面的代码就是其中之一

import Bio
from Bio import SeqIO

seqs= SeqIO.parse(X, 'fasta') ### to read the sequences from fasta file
for aa in seqs:
    x=aa.seq ## gives the sequences as a string (.seq is a build in function of Biopython)
    
    for val, i in enumerate(x):          
        
        if i=='B':    
            if (x[val+2])=='C':
                
                if x[val+1]!='P':
                   pattern=((x[val]:x[val+2])) ## trying to print full sequence B-C
                

但不幸的是 none 它们都有效。 如果有人能帮我解决这个问题就太好了。

在 python 中您可以使用 Regex 模块 (re):

import re      # import the RE module
import Bio
from Bio import SeqIO

seqs = SeqIO.parse(X, 'fasta')
for sequence in seqs:
    line = sequence.se

    RE = r'B[A-OQ-Z]C|M[A-OQ-Z]D'
    # [A-OQ-Z] : Match from A to O and from Q to Z (exl. P)
    # | : is an operator OR = either the left or right part should match
    # The r before the string specify that the string is regex:  r"regex"

    results = re.findall(RE, line)
    # The function findall will return a list of all non-overlapping matches.

    # To iterate over each result :
    for res in results:
        print(res)

然后您还可以修改正则表达式以匹配您想要匹配的任何其他规则。

有关 findall 函数的更多信息,请点击此处:re.findall(...)

以下网站可以帮助您构建正则表达式: https://regex101.com/

>>> x = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
>>> import re
>>> m = re.search('B(.+?)C', x)
>>> m
<_sre.SRE_Match object at 0x10262aeb0>
>>> m = re.search('B(.+?)C', x).group(0)
>>> m
'BAC'
>>> m = re.search('M(.+?)D', x).group(0)
>>> m
'MLD'
>>> re.search(r"(?<=M).*?(?=D)", x).group(0)
'L'
>>> re.search(r"(?<=B).*?(?=C)", x).group(0)
'A'

模式匹配的常见解决方案是使用正则表达式。

您的问题可能的正则表达式是 B[^P]C|M[^P]D

以下代码由 regex101 使用我建议的正则表达式和您提供的测试字符串生成。它找到所有匹配模式及其在原始字符串中的位置。

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"B[^P]C|M[^P]D"

test_str = "VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

使用带有异常断言“^”的regular expression

import re

string = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
re.findall(r"B[^P]C|M[^P]D", string)

输出:

['BAC', 'MLD']