如何将 reStructuredText 代码块与 Regex 和 Python 相匹配？

Question

我正在尝试使用 Python 和 regex[=32 从 .rst 文档中提取 code block =].文档中的代码块是通过向文本添加 .. code-block:: python 指令然后缩进几个空格来定义的。

这是我的测试文档中的示例：

.. code-block:: python import os from selenium import webdriver from axe_selenium_python import Axe def test_google(): driver = webdriver.Firefox() driver.get("http://www.google.com") axe = Axe(driver) # Inject axe-core javascript into page. axe.inject() # Run axe accessibility checks. results = axe.execute() # Write results to file axe.write_results(results, 'a11y.json') driver.close() # Assert no violations are found assert len(results["violations"]) == 0, axe.report(results["violations"]) driver.close()

到目前为止我有这个正则表达式： (\.\. code-block:: python\s\s)(.*\s.+).*?\n\s+(.*\s.+)+

这个模式的问题在于它只选择了测试字符串的第一部分和最后一部分。我需要帮助编写一个模式，该模式可以捕获 .. code-block:: python 代码块中的一切，不包括 ..code-block:: python 指令。

你可以看到我用这个here取得的进步。

Answer 1

如果您坚持使用正则表达式，下面给出的示例应该可以解决问题：

import re

pattern = r"(\.\. code-block:: python\s+$)((\n +.*|\s)+)"

matches = re.finditer(pattern, text, re.M)

for m, match in enumerate(matches):
    for g, group_text in enumerate(match.groups()):
        print("###match {}, group {}:###".format(m, g))
        print(group_text, end="")

我认为，诀窍是使用嵌套括号和 MULTILINE 或 M 标志。

结果 match object(s) 将有 3 groups，如括号所定义：

第 1 组：'..code-block:'header
第2组：代码块的内容
第 3 组：由于额外的分组括号而导致空组。

要检索组 n，请使用 match.group(n)。请注意，组的索引从 1 开始，传递 0 或不传递参数将导致整个匹配字符串。

如何将 reStructuredText 代码块与 Regex 和 Python 相匹配？

How can I match a reStructuredText code block with Regex and Python?

python

regex

restructuredtext