sphinx gallery的regex解释
Regex explanation of sphinx gallery
我正在调试涉及以下代码的 sphinx 画廊工具提示生成:
def extract_intro_and_title(filename, docstring):
"""Extract and clean the first paragraph of module-level docstring."""
# lstrip is just in case docstring has a '\n\n' at the beginning
paragraphs = docstring.lstrip().split('\n\n')
# remove comments and other syntax like `.. _link:`
paragraphs = [p for p in paragraphs
if not p.startswith('.. ') and len(p) > 0]
if len(paragraphs) == 0:
raise ExtensionError(
"Example docstring should have a header for the example title. "
"Please check the example file:\n {}\n".format(filename))
# Title is the first paragraph with any ReSTructuredText title chars
# removed, i.e. lines that consist of (3 or more of the same) 7-bit
# non-ASCII chars.
# This conditional is not perfect but should hopefully be good enough.
title_paragraph = paragraphs[0]
match = re.search(r'^(?!([\W _]){3,})(.+)', title_paragraph,
re.MULTILINE)
if match is None:
raise ExtensionError(
'Could not find a title in first paragraph:\n{}'.format(
title_paragraph))
title = match.group(0).strip()
# Use the title if no other paragraphs are provided
intro_paragraph = title if len(paragraphs) < 2 else paragraphs[1]
# Concatenate all lines of the first paragraph and truncate at 95 chars
intro = re.sub('\n', ' ', intro_paragraph)
intro = _sanitize_rst(intro)
if len(intro) > 95:
intro = intro[:95] + '...'
return intro, title
我不明白的那一行是:
match = re.search(r'^(?!([\W _]){3,})(.+)', title_paragraph,
re.MULTILINE)
有人可以给我解释一下吗?
开始:
>>> import re
>>> help(re.search)
Help on function search in module re:
search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning
a Match object, or None if no match was found.
(END)
这告诉我们 re.search
采用模式、字符串和默认为 0 的可选标志。
这可能并没有多大帮助。
传递的标志是 re.MULTILINE
。这告诉正则表达式引擎将 ^
和 $
视为每行的开头和结尾。默认情况下,这些适用于字符串的开头和结尾,无论组成字符串的行数如何。
正在匹配的模式正在寻找以下内容:
^
- 模式必须在每行的开头开始
(?!([\W _]){3,})
- 前四个字符不能是:non-word 个字符 (\W
)、空格 (
) 或下划线 (_
).这是使用否定前瞻 ((?!
... )
) 匹配括号中的字符组 (([\W _])
),这意味着捕获组 1。此匹配必须重复 3 次或更多次 ( {3,}
)。 </code> 表示捕获组 1 的内容,<code>{3,}
表示至少 3 次。匹配加上匹配的 3 次重复强制前 4 个字符不能重复 non-word 个字符。此匹配不消耗任何字符,它仅在条件为真时匹配一个位置。
作为旁注,\W
匹配 \w
的对立面,即 shorthand 对应 [A-Za-z0-9_]
。这意味着 \W
对于 [^A-Za-z0-9_]
是 shorthand
(.+)
- 如果前面的位置匹配成功,如果该行由 1 个或多个字符组成,则整行将在捕获组 2 中匹配。
https://regex101.com/r/3p73lf/1 探索正则表达式的行为。
我正在调试涉及以下代码的 sphinx 画廊工具提示生成:
def extract_intro_and_title(filename, docstring):
"""Extract and clean the first paragraph of module-level docstring."""
# lstrip is just in case docstring has a '\n\n' at the beginning
paragraphs = docstring.lstrip().split('\n\n')
# remove comments and other syntax like `.. _link:`
paragraphs = [p for p in paragraphs
if not p.startswith('.. ') and len(p) > 0]
if len(paragraphs) == 0:
raise ExtensionError(
"Example docstring should have a header for the example title. "
"Please check the example file:\n {}\n".format(filename))
# Title is the first paragraph with any ReSTructuredText title chars
# removed, i.e. lines that consist of (3 or more of the same) 7-bit
# non-ASCII chars.
# This conditional is not perfect but should hopefully be good enough.
title_paragraph = paragraphs[0]
match = re.search(r'^(?!([\W _]){3,})(.+)', title_paragraph,
re.MULTILINE)
if match is None:
raise ExtensionError(
'Could not find a title in first paragraph:\n{}'.format(
title_paragraph))
title = match.group(0).strip()
# Use the title if no other paragraphs are provided
intro_paragraph = title if len(paragraphs) < 2 else paragraphs[1]
# Concatenate all lines of the first paragraph and truncate at 95 chars
intro = re.sub('\n', ' ', intro_paragraph)
intro = _sanitize_rst(intro)
if len(intro) > 95:
intro = intro[:95] + '...'
return intro, title
我不明白的那一行是:
match = re.search(r'^(?!([\W _]){3,})(.+)', title_paragraph,
re.MULTILINE)
有人可以给我解释一下吗?
开始:
>>> import re
>>> help(re.search)
Help on function search in module re:
search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning
a Match object, or None if no match was found.
(END)
这告诉我们 re.search
采用模式、字符串和默认为 0 的可选标志。
这可能并没有多大帮助。
传递的标志是 re.MULTILINE
。这告诉正则表达式引擎将 ^
和 $
视为每行的开头和结尾。默认情况下,这些适用于字符串的开头和结尾,无论组成字符串的行数如何。
正在匹配的模式正在寻找以下内容:
^
- 模式必须在每行的开头开始
(?!([\W _]){3,})
- 前四个字符不能是:non-word 个字符 (\W
)、空格 (
) 或下划线 (_
).这是使用否定前瞻 ((?!
... )
) 匹配括号中的字符组 (([\W _])
),这意味着捕获组 1。此匹配必须重复 3 次或更多次 ( {3,}
)。 </code> 表示捕获组 1 的内容,<code>{3,}
表示至少 3 次。匹配加上匹配的 3 次重复强制前 4 个字符不能重复 non-word 个字符。此匹配不消耗任何字符,它仅在条件为真时匹配一个位置。
作为旁注,\W
匹配 \w
的对立面,即 shorthand 对应 [A-Za-z0-9_]
。这意味着 \W
对于 [^A-Za-z0-9_]
(.+)
- 如果前面的位置匹配成功,如果该行由 1 个或多个字符组成,则整行将在捕获组 2 中匹配。
https://regex101.com/r/3p73lf/1 探索正则表达式的行为。