处理括号的正则表达式

Question

我有多个字符串，例如

string1 = """[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''"""
string2 = """[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]""" 
string3 = """[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]"""
strings = [string1, string2, string3]

每个字符串都包含一个或多个“[br]”。

每个字符串可能包含也可能不包含注释。

每条注释都以“[*”开头，以“]”结尾。它可能包括双括号（“[[”和“]]”），但从不包含单个括号（“[”和“]”），所以不会有任何混淆（例如 [* 一些带有 [[brackets] 的注释]]).

我要替换的词是第一个“[br]”和注解之间的词（如果有的话，否则就是字符串的末尾），它们是

word1 = """팔짱낄 공''':'''"""
word2 = """낟알 과'''-'''"""
word3 = """둘레 곽[br]클 확"""

所以我尝试了

for string in strings:
    print(re.sub(r"\[br\](.)+?(\[\*)+", "AAAA", string))

期待

[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]

正则表达式的逻辑是

\[br\] : 第一个“[br]”

(.)+? : 我要替换的一个或多个字符，lazy

(\[\*)+ : 一个或多个 "[*"s

但结果是

[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[顆|{{{#!html}}}]]AAAA some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]]AAAA another annotation.][* another annotation.]

代替。我也试过 r"\[br\](.)+?(\[\*)*" 但还是不行。我该如何解决这个问题？

Answer 1

我能想到的最好办法是首先检查是否有任何注释：

import re
r = re.compile(r'''
    (\[br])      
    (.*?)
    (\[\*.*\]$)
''', re.VERBOSE)

annotation = re.compile(r'''
    (\[\*.*]$)
''', re.VERBOSE)

def replace(m):
    return m.group(1) + "AAAA" + m.group(3)

for s in string1, string2, string3:
    print()
    print(s)
    if annotation.search(s):
        print(r.sub(replace, s))
    else:
        print(re.sub(r'\[br](.*)', '[br]AAAA', s))

给出了预期的输出：

[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[拱|{{{#!html}}}]][br]AAAA

[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]

[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]

我想您可以将 if 移动到 replace 函数中，但我不确定这是否会有很大的改进。它看起来像：

import re
r = re.compile(r'''
    ^(?P<prefix>.*)
    (?P<br>\[br].*?)
    (?P<annotation>\[\*.*\])?
    (?P<rest>[^\[]*)$
''', re.VERBOSE)

def replace(m):
    g = m.groupdict()
    if g['annotation'] is None:
        return g['prefix'] + "[br]AAAA" + g['rest']
    # the prefix will contain all but the last [br], thus the split...
    return g['prefix'].split('[br]')[0] + "[br]AAAA" + g['annotation'] + g['rest']

for s in string1, string2, string3:
    print()
    print(s)
    print(r.sub(replace, s))

Answer 2

你可以使用

^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)

模式匹配

^ 字符串开头
(.*?\[br]) 捕获 组 1，匹配尽可能少的字符，直到第一次出现 [br]
.+? 匹配任意字符 1+ 次
(?=正向前瞻，断言在右边
- \[\*.*?](?<!].)(?!]) 匹配 [* 直到 ] 不被 ]
- | 或
- $ 断言字符串结束
) 关闭前瞻

替换为捕获组 1 和 AAAA，如 AAAA

Regex demo | Python demo

示例代码

import re

pattern = r"^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)"

s = ("[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''\n"
            "[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', \") and brackets(\"(\", \")\", \"[[\", \"]]\").]\n"
            "[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]")

subst = "AAAA"
result = re.sub(pattern, r"AAAA", s, 0, re.MULTILINE)
print(result)

输出

[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]

处理括号的正则表达式

regex dealing with brackets

python

regex

regexp-replace