在 python re 中使用 findall 时,未指定区域 length/characters 的语法是什么?
What is the syntax for an area of unspecificed length/characters when using findall in python re?
我正在尝试使用正则表达式在我拥有的一些数据中捕获蛋白质名称及其相应的氨基酸序列。这是我的代码的精简版:
import re
line=">sp|A0A385XJ53|INSA9_ECOLI Insertion element OS=Escherichia coli (strain K12) PE=3 SV=1 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA >sp|A0A385XJE6|INH21_ECOLI Transposase InsH for insertion sequence element OS=Escherichia coli (strain K12) PE=3 SV=1 MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA >sp|A0A385XJL4|INSB9_ECOLI Insertion element IS1 9 protein OS=Escherichia coli (strain K12) PE=3 SV=2 MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF"
result1=re.findall(r'SV=\d\s([A-Z]+)', line)
result2=re.findall(r'>sp\|(\w+)\|', line)
result3=re.findall(r'>sp\|(\w+)\|\.\SV=\d\s([A-Z]+)', line)
for item1 in result1:
print(item1)
for item2 in result2:
print(item2)
for item3 in result3:
print(item3)
结果 1 输出:
MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA
MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA
MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF
结果 2 输出:
A0A385XJ53
A0A385XJE6
A0A385XJL4
然而,result3 没有任何输出。我的印象是“。”在使用正则表达式时可用于一系列未指定的字符。什么语法可以用于一系列没有设置长度的未指定字符?我基本上希望 python 寻找与 >sp\|(\w+)\| 的匹配项并继续直到找到 SV=\d\s([A-Z]+)。此时,它将重置为寻找 >sp\|(\w+)\| 的匹配项。我怎样才能做到这一点?我希望它输出如下内容:
A0A385XJ53 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA
A0A385XJE6 MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA
A0A385XJL4 MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF
我试过几个不同的东西,我想我可能不理解“.”的用法。由于我的代码已将所有蛋白质转换为单个字符串,我认为我可以在其位置使用“\b+”或“\b*”,因为没有新行。我得到以下两者的错误代码。
error Traceback (most recent call last)
<ipython-input-76-f43b57fdde31> in <module>()
8 result1=re.findall(r'SV=\d\s([A-Z]+)', line)
9 result2=re.findall(r'>sp\|(\w+)\|', line)
---> 10 result3=re.findall(r'>sp\|(\w+)\|\b*\SV=\d\s([A-Z]+)', line)
11 for item1 in result1:
12 print(item1)
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\re.py in findall(pattern, string, flags)
220
221 Empty matches are included in the result."""
--> 222 return _compile(pattern, flags).findall(string)
223
224 def finditer(pattern, string, flags=0):
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\re.py in _compile(pattern, flags)
299 if not sre_compile.isstring(pattern):
300 raise TypeError("first argument must be string or compiled pattern")
--> 301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
303 if len(_cache) >= _MAXCACHE:
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_compile.py in compile(p, flags)
560 if isstring(p):
561 pattern = p
--> 562 p = sre_parse.parse(p, flags)
563 else:
564 pattern = None
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in parse(str, flags, pattern)
853
854 try:
--> 855 p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
856 except Verbose:
857 # the VERBOSE flag was switched on inside the pattern. to be
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in _parse_sub(source, state, verbose, nested)
414 while True:
415 itemsappend(_parse(source, state, verbose, nested + 1,
--> 416 not nested and not items))
417 if not sourcematch("|"):
418 break
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in _parse(source, state, verbose, nested, first)
614 if not item or (_len(item) == 1 and item[0][0] is AT):
615 raise source.error("nothing to repeat",
--> 616 source.tell() - here + len(this))
617 if item[0][0] in _REPEATCODES:
618 raise source.error("multiple repeat",
error: nothing to repeat at position 14
在第三个模式中,你转义了 \S
,这意味着匹配一个非空白字符而不是字面上匹配 S
。 (虽然 Is 确实匹配 S 本身)
当您转义点 \.
时,它会按字面意思匹配示例数据中不存在的点。
在问题中阅读此内容 I essentially want python to look for a match to >sp\|(\w+)\| and continue until it finds SV=\d\s([A-Z]+). At which point, it will reset to looking for >sp\|(\w+)\|'s match.
我认为您想使用非贪心点 .+?
来匹配 2 个模式之间的内容,以使其至少匹配一个字符。
>sp\|(\w+)\|.+?SV=\d\s([A-Z]+)
我正在尝试使用正则表达式在我拥有的一些数据中捕获蛋白质名称及其相应的氨基酸序列。这是我的代码的精简版:
import re
line=">sp|A0A385XJ53|INSA9_ECOLI Insertion element OS=Escherichia coli (strain K12) PE=3 SV=1 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA >sp|A0A385XJE6|INH21_ECOLI Transposase InsH for insertion sequence element OS=Escherichia coli (strain K12) PE=3 SV=1 MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA >sp|A0A385XJL4|INSB9_ECOLI Insertion element IS1 9 protein OS=Escherichia coli (strain K12) PE=3 SV=2 MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF"
result1=re.findall(r'SV=\d\s([A-Z]+)', line)
result2=re.findall(r'>sp\|(\w+)\|', line)
result3=re.findall(r'>sp\|(\w+)\|\.\SV=\d\s([A-Z]+)', line)
for item1 in result1:
print(item1)
for item2 in result2:
print(item2)
for item3 in result3:
print(item3)
结果 1 输出:
MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA
MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA
MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF
结果 2 输出:
A0A385XJ53
A0A385XJE6
A0A385XJL4
然而,result3 没有任何输出。我的印象是“。”在使用正则表达式时可用于一系列未指定的字符。什么语法可以用于一系列没有设置长度的未指定字符?我基本上希望 python 寻找与 >sp\|(\w+)\| 的匹配项并继续直到找到 SV=\d\s([A-Z]+)。此时,它将重置为寻找 >sp\|(\w+)\| 的匹配项。我怎样才能做到这一点?我希望它输出如下内容:
A0A385XJ53 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA
A0A385XJE6 MFVIWSHRTGFIMSHQLTFADSEFSSKRRQTRKEIFLSRMEQILPWQNMVEVIEPFYPKA
A0A385XJL4 MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEMDEQWGYVGAKSRQRWLF
我试过几个不同的东西,我想我可能不理解“.”的用法。由于我的代码已将所有蛋白质转换为单个字符串,我认为我可以在其位置使用“\b+”或“\b*”,因为没有新行。我得到以下两者的错误代码。
error Traceback (most recent call last)
<ipython-input-76-f43b57fdde31> in <module>()
8 result1=re.findall(r'SV=\d\s([A-Z]+)', line)
9 result2=re.findall(r'>sp\|(\w+)\|', line)
---> 10 result3=re.findall(r'>sp\|(\w+)\|\b*\SV=\d\s([A-Z]+)', line)
11 for item1 in result1:
12 print(item1)
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\re.py in findall(pattern, string, flags)
220
221 Empty matches are included in the result."""
--> 222 return _compile(pattern, flags).findall(string)
223
224 def finditer(pattern, string, flags=0):
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\re.py in _compile(pattern, flags)
299 if not sre_compile.isstring(pattern):
300 raise TypeError("first argument must be string or compiled pattern")
--> 301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
303 if len(_cache) >= _MAXCACHE:
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_compile.py in compile(p, flags)
560 if isstring(p):
561 pattern = p
--> 562 p = sre_parse.parse(p, flags)
563 else:
564 pattern = None
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in parse(str, flags, pattern)
853
854 try:
--> 855 p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
856 except Verbose:
857 # the VERBOSE flag was switched on inside the pattern. to be
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in _parse_sub(source, state, verbose, nested)
414 while True:
415 itemsappend(_parse(source, state, verbose, nested + 1,
--> 416 not nested and not items))
417 if not sourcematch("|"):
418 break
~\OneDrive\Documents\Python stuff\Pythonstuff\lib\sre_parse.py in _parse(source, state, verbose, nested, first)
614 if not item or (_len(item) == 1 and item[0][0] is AT):
615 raise source.error("nothing to repeat",
--> 616 source.tell() - here + len(this))
617 if item[0][0] in _REPEATCODES:
618 raise source.error("multiple repeat",
error: nothing to repeat at position 14
在第三个模式中,你转义了 \S
,这意味着匹配一个非空白字符而不是字面上匹配 S
。 (虽然 Is 确实匹配 S 本身)
当您转义点 \.
时,它会按字面意思匹配示例数据中不存在的点。
在问题中阅读此内容 I essentially want python to look for a match to >sp\|(\w+)\| and continue until it finds SV=\d\s([A-Z]+). At which point, it will reset to looking for >sp\|(\w+)\|'s match.
我认为您想使用非贪心点 .+?
来匹配 2 个模式之间的内容,以使其至少匹配一个字符。
>sp\|(\w+)\|.+?SV=\d\s([A-Z]+)