在 python re 中使用 findall 时，未指定区域 length/characters 的语法是什么？

Question

我正在尝试使用正则表达式在我拥有的一些数据中捕获蛋白质名称及其相应的氨基酸序列。这是我的代码的精简版：

import re
line=">sp|A0A385XJ53|INSA9_ECOLI Insertion element OS=Escherichia coli (strain K12) PE=3 SV=1 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA >sp|A0A385XJE6|INH21_ECOLI Transposase InsH for insertion sequence element OS=Escherichia coli (strain K12) PE=3 SV=1 MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA >sp|A0A385XJL4|INSB9_ECOLI Insertion element IS1 9 protein OS=Escherichia coli (strain K12) PE=3 SV=2 MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF"

result1=re.findall(r'SV=\d\s([A-Z]+)', line)
result2=re.findall(r'>sp\|(\w+)\|', line)
result3=re.findall(r'>sp\|(\w+)\|\.\SV=\d\s([A-Z]+)', line)
for item1 in result1:
    print(item1)
for item2 in result2:
    print(item2)
for item3 in result3:
    print(item3)

结果 1 输出：

MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA
MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA
MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF

结果 2 输出：

A0A385XJ53
A0A385XJE6
A0A385XJL4

然而，result3 没有任何输出。我的印象是“。”在使用正则表达式时可用于一系列未指定的字符。什么语法可以用于一系列没有设置长度的未指定字符？我基本上希望 python 寻找与 >sp\|(\w+)\| 的匹配项并继续直到找到 SV=\d\s([A-Z]+)。此时，它将重置为寻找 >sp\|(\w+)\| 的匹配项。我怎样才能做到这一点？我希望它输出如下内容：

A0A385XJ53 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA
A0A385XJE6 MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA
A0A385XJL4 MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF

我试过几个不同的东西，我想我可能不理解“.”的用法。由于我的代码已将所有蛋白质转换为单个字符串，我认为我可以在其位置使用“\b+”或“\b*”，因为没有新行。我得到以下两者的错误代码。

error                                     Traceback (most recent call last)
<ipython-input-76-f43b57fdde31> in <module>()
      8 result1=re.findall(r'SV=\d\s([A-Z]+)', line)
      9 result2=re.findall(r'>sp\|(\w+)\|', line)
---> 10 result3=re.findall(r'>sp\|(\w+)\|\b*\SV=\d\s([A-Z]+)', line)
     11 for item1 in result1:
     12     print(item1)

~\OneDrive\Documents\Python stuff\Pythonstuff\lib\re.py in findall(pattern, string, flags)
    220 
    221     Empty matches are included in the result."""
--> 222     return _compile(pattern, flags).findall(string)
    223 
    224 def finditer(pattern, string, flags=0):

~\OneDrive\Documents\Python stuff\Pythonstuff\lib\re.py in _compile(pattern, flags)
    299     if not sre_compile.isstring(pattern):
    300         raise TypeError("first argument must be string or compiled pattern")
--> 301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):
    303         if len(_cache) >= _MAXCACHE:

~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_compile.py in compile(p, flags)
    560     if isstring(p):
    561         pattern = p
--> 562         p = sre_parse.parse(p, flags)
    563     else:
    564         pattern = None

~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in parse(str, flags, pattern)
    853 
    854     try:
--> 855         p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    856     except Verbose:
    857         # the VERBOSE flag was switched on inside the pattern.  to be

~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in _parse_sub(source, state, verbose, nested)
    414     while True:
    415         itemsappend(_parse(source, state, verbose, nested + 1,
--> 416                            not nested and not items))
    417         if not sourcematch("|"):
    418             break

~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in _parse(source, state, verbose, nested, first)
    614             if not item or (_len(item) == 1 and item[0][0] is AT):
    615                 raise source.error("nothing to repeat",
--> 616                                    source.tell() - here + len(this))
    617             if item[0][0] in _REPEATCODES:
    618                 raise source.error("multiple repeat",

error: nothing to repeat at position 14

Answer 1

在第三个模式中，你转义了 \S ，这意味着匹配一个非空白字符而不是字面上匹配 S 。（虽然 Is 确实匹配 S 本身）

当您转义点 \. 时，它会按字面意思匹配示例数据中不存在的点。

在问题中阅读此内容 I essentially want python to look for a match to >sp\|(\w+)\| and continue until it finds SV=\d\s([A-Z]+). At which point, it will reset to looking for >sp\|(\w+)\|'s match.

我认为您想使用非贪心点 .+? 来匹配 2 个模式之间的内容，以使其至少匹配一个字符。

>sp\|(\w+)\|.+?SV=\d\s([A-Z]+)

Regex demo

在 python re 中使用 findall 时，未指定区域 length/characters 的语法是什么？

What is the syntax for an area of unspecificed length/characters when using findall in python re?

python

regex

string

parsing

findall