finditer 和 findall 跳过子字符串

Question

我试图找到所有出现的例子p='gg' 里面 s='ggggg'。根据我的计数，应该有 4 个，因为除了最后一个位置之外的任何位置都是一个子字符串。例如 s[1:2] 是 'gg'。但是，同时尝试：

>re.findall('gg','ggggg')
['gg','gg']
>list(re.finditer('gg','ggggg'))
[<_sre.SRE_Match object; span=(0, 2), match='ab'>,
 <_sre.SRE_Match object; span=(6, 8), match='gg'>,
 <_sre.SRE_Match object; span=(8, 10), match='gg'>]

一旦找到匹配项，似乎就会跳过潜在的匹配项。此外，因此，搜索例如'star' 或 'start' 相当于只是寻找开始，因为我永远找不到第二个，因为第一个是它的前缀...

这是一个错误吗？如何执行完整的子字符串搜索？

示例 2：

>re.findall('star|start','starting')
['star']
>list(re.finditer('star|start','starting'))
[<_sre.SRE_Match object; span=(0, 4), match='star'>]

（我用的是Python3，re version 2.2.1）

Answer 1

您可能搜索的关键字是“重叠”。这是一个链接问题 String count with overlapping occurrences.

来自 re 文档。

find_all : Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

这似乎是一个功能而不是错误。

您可以实现自己的搜索功能，例如：

def my_search(s, substr):
    for i in len(s):
        if s[i:].startswith(substr):
            yield i

Answer 2

import re
re.findall('gg','ggggg')

结果是 2 场比赛，因为 re.findall 不寻找重叠的比赛，或者如 re docs 所说

Return all non-overlapping matches of pattern in string, as a list of strings.

所以这不是错误，而是符合文档的行为。

如果您被允许使用外部模块，您可以通过以下方式利用 regex：

import regex
print(re.findall('gg', 'ggggg', overlapped=True))

输出：

['gg', 'gg', 'gg', 'gg']

finditer 和 findall 跳过子字符串

finditer and findall jumping over substrings

python

python-re