sphinx gallery的regex解释

Regex explanation of sphinx gallery

我正在调试涉及以下代码的 sphinx 画廊工具提示生成:

def extract_intro_and_title(filename, docstring):
    """Extract and clean the first paragraph of module-level docstring."""
    # lstrip is just in case docstring has a '\n\n' at the beginning
    paragraphs = docstring.lstrip().split('\n\n')
    # remove comments and other syntax like `.. _link:`
    paragraphs = [p for p in paragraphs
                  if not p.startswith('.. ') and len(p) > 0]
    if len(paragraphs) == 0:
        raise ExtensionError(
            "Example docstring should have a header for the example title. "
            "Please check the example file:\n {}\n".format(filename))
    # Title is the first paragraph with any ReSTructuredText title chars
    # removed, i.e. lines that consist of (3 or more of the same) 7-bit
    # non-ASCII chars.
    # This conditional is not perfect but should hopefully be good enough.
    title_paragraph = paragraphs[0]
    match = re.search(r'^(?!([\W _]){3,})(.+)', title_paragraph,
                      re.MULTILINE)

    if match is None:
        raise ExtensionError(
            'Could not find a title in first paragraph:\n{}'.format(
                title_paragraph))
    title = match.group(0).strip()
    # Use the title if no other paragraphs are provided
    intro_paragraph = title if len(paragraphs) < 2 else paragraphs[1]
    # Concatenate all lines of the first paragraph and truncate at 95 chars
    intro = re.sub('\n', ' ', intro_paragraph)
    intro = _sanitize_rst(intro)
    if len(intro) > 95:
        intro = intro[:95] + '...'
    return intro, title

我不明白的那一行是:

match = re.search(r'^(?!([\W _]){3,})(.+)', title_paragraph,
                  re.MULTILINE)

有人可以给我解释一下吗?

开始:

>>> import re
>>> help(re.search)
Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.
(END)

这告诉我们 re.search 采用模式、字符串和默认为 0 的可选标志。

这可能并没有多大帮助。

传递的标志是 re.MULTILINE。这告诉正则表达式引擎将 ^$ 视为每行的开头和结尾。默认情况下,这些适用于字符串的开头和结尾,无论组成字符串的行数如何。

正在匹配的模式正在寻找以下内容:

^ - 模式必须在每行的开头开始

(?!([\W _]){3,}) - 前四个字符不能是:non-word 个字符 (\W)、空格 ( ) 或下划线 (_ ).这是使用否定前瞻 ((?! ... )) 匹配括号中的字符组 (([\W _])),这意味着捕获组 1。此匹配必须重复 3 次或更多次 ( {3,})。 </code> 表示捕获组 1 的内容,<code>{3,} 表示至少 3 次。匹配加上匹配的 3 次重复强制前 4 个字符不能重复 non-word 个字符。此匹配不消耗任何字符,它仅在条件为真时匹配一个位置。

作为旁注,\W 匹配 \w 的对立面,即 shorthand 对应 [A-Za-z0-9_]。这意味着 \W 对于 [^A-Za-z0-9_]

是 shorthand

(.+) - 如果前面的位置匹配成功,如果该行由 1 个或多个字符组成,则整行将在捕获组 2 中匹配。

https://regex101.com/r/3p73lf/1 探索正则表达式的行为。