检查字符串中的位置是否在一对特定字符内

Check if a position in string is within a pair of certain characters

在 python 中,确定字符串中的某个位置是否在一对特定字符序列内的最有效方法是什么?

       0--------------16-------------------37---------48--------57
       |               |                    |          |        |
cost=r"a) This costs $1 but price goes as $x^2$ for \(x\) item(s)."

在字符串cost中,我想知道某个位置是被一对$包围着,还是在\(\)内。

对于字符串 cost,函数 is_maths(cost,x) 将 return True for x in [37,38,39,48] 并计算为 False 对于其他地方。

动机是找出有效的乳胶数学位置,也欢迎使用 python 的任何替代有效方法。

您需要将字符串解析到请求的位置,如果在一对有效的 LaTeX 环境定界符内,直到结束定界符,才能用 TrueFalse。那是因为您必须处理每个相关的元字符(反斜杠、美元和括号)以确定它们的效果。

我了解到Latex's $...$ and \(...\) environment delimiters不能嵌套,所以你不用担心这里的嵌套语句;您只需要找到最近的完整 $...$\(...\) 对。

您不能只匹配文字 $\(\) 字符,但是,因为每个字符前面都可以有任意数量的 \反斜杠。相反,tokenize 反斜杠、美元或括号上的输入字符串,并按顺序遍历标记并跟踪最后匹配的内容以确定它们的效果(转义下一个字符,以及打开和关闭数学环境)。

如果您超过了请求的位置并且超出了数学环境部分,则无需继续解析;你已经有了答案,可以 return False 早点。

这是我对这样一个解析器的实现:

import re

_maths_pairs = {
    # keys are opening characters, values matching closing characters
    # each is a tuple of char (string), escaped (boolean)
    ('$', False): ('$', False),
    ('(', True): (')', True),
}
_tokens = re.compile(r'[\$()]')

def _tokenize(s):
    """Generator that produces token, pos, prev_pos tuples for s

    * token is a single character: a backslash, dollar or parethesis
    * pos is the index into s for that token
    * prev_pos is te position of the preceding token, or -1 if there
      was no preceding token

    """
    prev_pos = -1
    for match in _tokens.finditer(s):
        token, pos = match[0], match.start()
        yield token, pos, prev_pos
        prev_pos = pos

def is_maths(s, pos):
    """Determines if pos in s is within a LaTeX maths environment"""
    expected_closer = None  # (char, escaped) if within $...$ or \(...\)
    opener_pos = None  # position of last opener character
    escaped = False  # True if the most recent token was an escaping backslash

    for token, token_pos, prev_pos in _tokenize(s):
        if expected_closer is None and token_pos > pos:
            # we are past the desired position, it'll never be within a
            # maths environment.
            return False

        # if there was more text between the current token and the last
        # backslash, then that backslash applied to something else.
        if escaped and token_pos > prev_pos + 1:
            escaped = False

        if token == '\':
            # toggle the escaped flag; doubled escapes negate
            escaped = not escaped
        elif (token, escaped) == expected_closer:
            if opener_pos < pos < token_pos:
                # position is after the opener, before the closer
                # so within a maths environment.
                return True
            expected_closer = None
        elif expected_closer is None and (token, escaped) in _maths_pairs:
            expected_closer = _maths_pairs[(token, escaped)]
            opener_pos = token_pos

        prev_pos = token_pos

    return False

演示:

>>> cost = r'a) This costs $1 but price goes as $x^2$ for \(x\) item(s).'
>>> is_maths(cost, 0)  # should be False
False
>>> is_maths(cost, 16)  # should be False, preceding $ is escaped
False
>>> is_maths(cost, 37)  # should be True, within $...$
True
>>> is_maths(cost, 48)  # should be True, within \(...\)
True
>>> is_maths(cost, 57)  # should be False, within unescaped (...)
False

以及表明转义处理正确的其他测试:

>>> is_maths(r'Doubled escapes negate: \$x^2$', 27)  # should be true
True
>>> is_maths(r'Doubled escapes negate: \(x\)', 27)  # no longer escaped, so false
False

我的实现故意忽略了畸形的 LaTeX 问题; \(...\) 内的未转义 $ 字符或 $...$ 内的转义 \(\) 字符将被忽略,[=16= 内的进一步 \( 开启符也是如此] 序列,或 \) 没有匹配 \( 开场白的闭幕词。这确保即使在给定 LaTeX 本身不会呈现的输入时,该函数也能继续工作。然而,在这些情况下,解析器 可以 被更改为抛出异常或 return False。在这种情况下,您需要添加一个从 _math_pairs.keys() | _math_pairs.values() 创建的全局集,并在 expected_closer is not None and (token, escaped) != expected_closer 为 false 时针对该集测试 (char, escaped)(检测嵌套环境定界符)并测试 char == ')' and escaped and expected_closer is None 以在没有开瓶器问题的情况下检测到 \) 关闭器。