正则表达式匹配“|”联合类型的分隔值

Question

我正在尝试匹配 int | str 之类的类型注释，并使用正则表达式替换将它们替换为字符串 Union[int, str].

所需替换（前后）：

str|int|bool -> Union[str,int,bool]
Optional[int|tuple[str|int]] -> Optional[Union[int,tuple[Union[str,int]]]]
dict[str | int, list[B | C | Optional[D]]] -> dict[Union[str,int], list[Union[B,C,Optional[D]]]]

到目前为止我想出的正则表达式如下：

r"\w*(?:\[|,|^)[\t ]*((?'type'[a-zA-Z0-9_.\[\]]+)(?:[\t ]*\|[\t ]*(?&type))+)(?:\]|,|$)"

你可以试试here on Regex Demo。它并没有真正按照我想要的方式工作。到目前为止我注意到的问题：

目前看来还不能处理嵌套的联合条件。例如，int | tuple[str|int] | bool 似乎会导致一场比赛，而不是两场比赛（包括内部 Union 条件）。
正则表达式最后似乎消耗了不必要的 ]。
可能是最重要的一个，但我注意到 Python 中的 re 模块似乎不支持正则表达式子例程。 Here 是我想到使用它的地方。

附加信息

这主要是为了支持 Python 3.7+ 的 PEP 604 语法，这需要支持前向声明的注释（例如声明为字符串），否则内置类型不支持不支持 | 运算符。

这是我想出的示例代码：

from __future__ import annotations

import datetime
from decimal import Decimal
from typing import Optional


class A:
    field_1: str|int|bool
    field_2: int  |  tuple[str|int]  |  bool
    field_3: Decimal|datetime.date|str
    field_4: str|Optional[int]
    field_5: Optional[int|str]
    field_6: dict[str | int, list[B | C | Optional[D]]]

class B: ...
class C: ...
class D: ...

对于 3.10 之前的 Python 版本，我使用 __future__ 导入来避免以下错误：

TypeError: unsupported operand type(s) for |: 'type' and 'type'

这基本上将所有注释转换为字符串，如下所示：

>>> A.__annotations__
{'field_1': 'str | int | bool', 'field_2': 'int | tuple[str | int] | bool', 'field_3': 'Decimal | datetime.date | str', 'field_4': 'str | Optional[int]', 'field_5': 'Optional[int | str]', 'field_6': 'dict[str | int, list[B | C | Optional[D]]]'}

但在代码中（比如在另一个模块中），我想评估 A 中的注释。这在 Python 3.10 中有效，但在 Python 3.7+ 中失败，即使 __future__导入支持前向声明的注释。

>>> from typing import get_type_hints
>>> hints = get_type_hints(A)

Traceback (most recent call last):
    eval(self.__forward_code__, globalns, localns),
  File "<string>", line 1, in <module>
TypeError: unsupported operand type(s) for |: 'type' and 'type'

看起来最好的方法是用 Union[int, str] 替换所有出现的 int | str（例如），然后用 typing.Union 包含在附加 localns 用于评估注释，然后应该可以评估 Python 3.7+ 的 PEP 604 样式注释。

Answer 1

您可以安装 PyPi regex 模块（因为 re 不支持递归）并使用

import regex
text = "str|int|bool\nOptional[int|tuple[str|int]]\ndict[str | int, list[B | C | Optional[D]]]"
rx = r"(\w+\[)(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+)]"
n = 1
res = text
while n != 0:
    res, n = regex.subn(rx, lambda x: "{}Union[{}]]".format(x.group(1), regex.sub(r'\s*\|\s*', ',', x.group(2))), res) 

print( regex.sub(r'\w+(?:\s*\|\s*\w+)+', lambda z: "Union[{}]".format(regex.sub(r'\s*\|\s*', ',', z.group())), res) )

输出：

Union[str,int,bool]
Optional[Union[int,tuple[Union[str,int]]]]
dict[Union[str,int], list[Union[B,C,Optional[D]]]]

参见Python demo。

first regex 查找所有类型的 WORD[...]，其中包含管道字符和其他 WORD 或 WORD[...]，其中没有管道字符。

\w+(?:\s*\|\s*\w+)+ regex 匹配 2 个或更多用竖线和可选空格分隔的单词。

第一个图案详情：

(\w+\[) - 第 1 组（这将在替换开始时保持原样）：一个或多个单词字符，然后是 [ 字符
(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+) - 第 2 组（它将放在 Union[...] 中，所有 \s*\|\s* 模式都替换为 ,）：
- \w+ - 一个或多个单词字符
- (\[(?:[^][|]++|(?3))*])? - 匹配 [ 字符的可选第 3 组，后跟零次或多次出现的一个或多个 [ 或 ] 字符或整个组3 递归（因此，它匹配嵌套的括号）然后是 ] char
- (?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+ - 一次或多次出现（因此匹配包含至少一个要替换为 , 的竖线字符）：
  - \s*\|\s* - 包含零个或多个空格的管道字符
  - \w+ - 一个或多个单词字符
  - (\[(?:[^][|]++|(?4))*])? - 可选的第 4 组（匹配与第 3 组相同的内容，请注意 (?4) 子例程重复第 4 组模式）
] - 一个 ] 字符。

Answer 2

只是一个更新，但我终于能够（完全工作）非- regex 解决这个问题的方法。我花了这么长时间的原因是因为它实际上需要我认真思考和深思熟虑。事实上，这并不容易做到；我花了两天断断续续的工作才真正把所有的东西拼凑起来，也让我能够全神贯注地思考我想要完成的事情。

@Wiktor 提出的是目前公认的答案，总体上效果很好。我实际上（稍后回过头来）发现只有少数边缘情况它无法处理，我在这里讨论过。但是，出于以下几个原因，我不得不怀疑 非正则表达式 解决方案是否可能是更好的选择：

我的实际用例是我正在构建一个库（包），所以我想尽可能减少依赖性。令人遗憾的是 regex 模块是一个外部依赖项，其大小也不容忽视；就我而言，我可能需要将此依赖项作为额外功能添加到我的库中。
正则表达式匹配似乎没有我希望的那么快或高效。不要误会我的意思，它仍然非常快地匹配 post 中提到的复杂用例（平均大约 1-3 毫秒），但是如果 class 有很多注释，我可以明白这会很快加起来。因此，我怀疑非正则表达式方法几乎肯定会更快，并且很想测试一下。

因此，我post正在使用我能够在下面拼凑的非正则表达式实现。这解决了我最初将 X|Y 等联合类型注释转换为 Union[X, Y] 等注释的问题，并且还超越了支持更复杂的用例，我发现正则表达式实现实际上没有考虑为了。我仍然更喜欢 regex 版本，因为我认为它比这个简单得多，而且在大多数情况下，我相信它最终会完美无误地工作。

但是，请注意，这是我能够针对此特定问题组合起来的第一个也是唯一一个非正则表达式实现。事不宜迟，这里是：

from typing import Iterable, Dict, List


# Constants
OPEN_BRACKET = '['
CLOSE_BRACKET = ']'
COMMA = ','
OR = '|'


def repl_or_with_union(s: str):
    """
    Replace all occurrences of PEP 604- style annotations (i.e. like `X | Y`)
    with the Union type from the `typing` module, i.e. like `Union[X, Y]`.

    This is a recursive function that splits a complex annotation in order to
    traverse and parse it, i.e. one that is declared as follows:

      dict[str | Optional[int], list[list[str] | tuple[int | bool] | None]]
    """
    return _repl_or_with_union_inner(s.replace(' ', ''))


def _repl_or_with_union_inner(s: str):

    # If there is no '|' character in the annotation part, we just return it.
    if OR not in s:
        return s

    # Checking for brackets like `List[int | str]`.
    if OPEN_BRACKET in s:

        # Get any indices of COMMA or OR outside a braced expression.
        indices = _outer_comma_and_pipe_indices(s)

        outer_commas = indices[COMMA]
        outer_pipes = indices[OR]

        # We need to check if there are any commas *outside* a bracketed
        # expression. For example, the following cases are what we're looking
        # for here:
        #     value[test], dict[str | int, tuple[bool, str]]
        #     dict[str | int, str], value[test]
        # But we want to ignore cases like these, where all commas are nested
        # within a bracketed expression:
        #     dict[str | int, Union[int, str]]
        if outer_commas:
            return COMMA.join(
                [_repl_or_with_union_inner(i)
                 for i in _sub_strings(s, outer_commas)])

        # We need to check if there are any pipes *outside* a bracketed
        # expression. For example:
        #     value | dict[str | int, list[int | str]]
        #     dict[str, tuple[int | str]] | value
        # But we want to ignore cases like these, where all pipes are
        # nested within the a bracketed expression:
        #     dict[str | int, list[int | str]]
        if outer_pipes:
            or_parts = [_repl_or_with_union_inner(i)
                        for i in _sub_strings(s, outer_pipes)]

            return f'Union{OPEN_BRACKET}{COMMA.join(or_parts)}{CLOSE_BRACKET}'

        # At this point, we know that the annotation does not have an outer
        # COMMA or PIPE expression. We also know that the following syntax
        # is invalid: `SomeType[str][bool]`. Therefore, knowing this, we can
        # assume there is only one outer start and end brace. For example,
        # like `SomeType[str | int, list[dict[str, int | bool]]]`.

        first_start_bracket = s.index(OPEN_BRACKET)
        last_end_bracket = s.rindex(CLOSE_BRACKET)

        # Replace the value enclosed in the outermost brackets
        bracketed_val = _repl_or_with_union_inner(
            s[first_start_bracket + 1:last_end_bracket])

        start_val = s[:first_start_bracket]
        end_val = s[last_end_bracket + 1:]

        return f'{start_val}{OPEN_BRACKET}{bracketed_val}{CLOSE_BRACKET}{end_val}'

    elif COMMA in s:
        # We are dealing with a string like `int | str, float | None`
        return COMMA.join([_repl_or_with_union_inner(i)
                           for i in s.split(COMMA)])

    # We are dealing with a string like `int | str`
    return f'Union{OPEN_BRACKET}{s.replace(OR, COMMA)}{CLOSE_BRACKET}'


def _sub_strings(s: str, split_indices: Iterable[int]):
    """Split a string on the specified indices, and return the split parts."""
    prev = -1

    for idx in split_indices:
        yield s[prev+1:idx]
        prev = idx

    yield s[prev+1:]


def _outer_comma_and_pipe_indices(s: str) -> Dict[str, List[int]]:
    """Return any indices of ',' and '|' that are outside of braces."""
    indices = {OR: [], COMMA: []}
    brace_dict = {OPEN_BRACKET: 1, CLOSE_BRACKET: -1}
    brace_count = 0

    for i, char in enumerate(s):
        if char in brace_dict:
            brace_count += brace_dict[char]
        elif not brace_count and char in indices:
            indices[char].append(i)

    return indices

我已经针对上述问题中列出的常见用例以及甚至正则表达式实现似乎都难以应对的更复杂的用例对其进行了测试。

例如，给定这些示例测试用例：

test_cases = """
str|int|bool
Optional[int|tuple[str|int]]
dict[str | int, list[B | C | Optional[D]]]
dict[str | Optional[int], list[list[str] | tuple[int | bool] | None]]
tuple[str|OtherType[a,b|c,d], ...] | SomeType[str | int, list[dict[str, int | bool]]] | dict[str | int, str]
"""

for line in test_cases.strip().split('\n'):
    print(repl_or_with_union(line).replace(',', ', '))

那么结果如下（注意我已经把,换成了, 所以读起来更容易一点）

Union[str, int, bool]
Optional[Union[int, tuple[Union[str, int]]]]
dict[Union[str, int], list[Union[B, C, Optional[D]]]]
dict[Union[str, Optional[int]], list[Union[list[str], tuple[Union[int, bool]], None]]]
Union[tuple[Union[str, OtherType[a, Union[b, c], d]], ...], SomeType[Union[str, int], list[dict[str, Union[int, bool]]]], dict[Union[str, int], str]]

现在，正则表达式实现无法唯一能够正确解析的是最后两种情况，可以说它们一开始就相当复杂。这是最后两个的正则表达式解决方案——不幸的是，这不是我们想要的（同样，我确保每个逗号后都有一个 space，这样更容易阅读）

dict[Union[str, Optional][int],  list[Union[list[str], tuple[Union[int, bool]], None]]]
tuple[Union[str, OtherType][a, Union[b, c], d],  ...] | SomeType[Union[str, int],  list[dict[str,  Union[int, bool]]]] | dict[Union[str, int],  str]

也许值得回顾一下为什么这些案例没有按预期使用 regex 版本处理？我的怀疑是 | 表达式中包含方括号 [] 的任何值似乎都无法正确解析，并且在测试后得到了证实。例如，str | Optional[int] 当前解析为 Union[str,Optional][int]，但理想情况下会像 Union[str,Optional[int]].

那样处理

我将上面的两个测试用例归结为下面的缩写形式，为此我能够确认正则表达式没有按预期处理：

str | Optional[int]
tuple[str|OtherType[a,b|c,d], ...] | SomeType[str]

通过正则表达式实现进行解析时，这些是当前结果。请注意，在其中一个结果中，| 字符也会出现，但理想情况下我们会将其删除，因为早于 3.10 的 Python 版本将无法计算管道 | 表达式针对内置类型。

Union[str,Optional][int]
tuple[Union[str,OtherType][a,Union[b,c],d], ...] | SomeType[str]

期望的最终结果（非正则表达式方法似乎按预期解决，在我修复它以处理测试时的此类情况后）如下：

Union[str, Optional[int]]
Union[tuple[Union[str,OtherType[a,Union[b,c],d]], ...], SomeType[str]]

最后，我还能够根据上面的正则表达式方法对其进行计时。我自己很好奇这个解决方案与 regex 版本相比有何不同，后者可以说更简单、更容易理解。

我测试的代码如下：

def regex_repl_or_with_union(text):
    rx = r"(\w+\[)(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+)]"
    n = 1
    res = text
    while n != 0:
        res, n = regex.subn(rx, lambda x: "{}Union[{}]]".format(x.group(1), regex.sub(r'\s*\|\s*', ',', x.group(2))),
                            res)

    return regex.sub(r'\w+(?:\s*\|\s*\w+)+', lambda z: "Union[{}]".format(regex.sub(r'\s*\|\s*', ',', z.group())), res)

test_cases = """
str|int|bool
Optional[int|tuple[str|int]]
dict[str | int, list[B | C | Optional[D]]]
"""


def non_regex_solution():
    for line in test_cases.strip().split('\n'):
        _ = repl_or_with_union(line)


def regex_solution():
    for line in test_cases.strip().split('\n'):
        _ = regex_repl_or_with_union(line)

n = 100_000
print('Non-regex: ', timeit('non_regex_solution()', globals=globals(), number=n))
print('Regex:     ', timeit('regex_solution()', globals=globals(), number=n))

结果 - 运行在 Alienware PC 上，AMD Ryzen 7 3700X 8 核处理器/16GB 内存：

Non-regex:  2.0510589000186883
Regex:      31.39290289999917

所以，我想出的非正则表达式实现实际上比正则表达式实现平均快 15x，这令人难以置信。对我来说最好的消息是它不涉及额外的依赖项。我现在可能会继续使用非正则表达式解决方案，请注意这主要是因为我想尽可能减少项目依赖性。再次非常感谢 @Wiktor 和所有帮助解决这个问题并帮助引导我找到解决方案的人！

正则表达式匹配“|”联合类型的分隔值

Regex matching "|" separated values for Union types

python

regex

annotations

python-3.x

python-typing

附加信息