使用 Biopython 在蛋白质序列中定位模式

Question

我正在尝试查找具有三肽的序列。三肽后面可以有任何其他氨基酸，'P' 除外。我通过以下方式提取它们。

from Bio import SeqIO
RGD = [] 
for record in SeqIO.parse("input.fasta", "fasta"):
    rgd_count = record.seq.count('RGD')
    if rgd_count >= 1:
        RGD.append(record) 
SeqIO.write(RGD, "RGD_Proteins.fasta", "fasta")

我如何在其中引入正则表达式，使得 RGD(N) 可以，但 RGDP 除外？

提前致谢。

AP

Answer 1

您可以使用 re.findall 查找 str(record.seq) 中所有非重叠的正则表达式匹配项。替换 record.seq.count('RGD')

len(re.findall(r"RGD(?!P)", str(record.seq)))

此外，请务必添加 import re。

RGD(?!P) 模式匹配 RGD 后跟 P 的子字符串。 (?!P) 称为 负先行 ，如果在当前位置的右侧立即找到其模式，则匹配失败。

见Regular-Expressions.info "Lookarounds" section。

参见regex demo。

使用 Biopython 在蛋白质序列中定位模式

Locate for pattern in protein sequence with Biopython

pattern-matching

biopython

python-3.x