处理括号的正则表达式
regex dealing with brackets
我有多个字符串,例如
string1 = """[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''"""
string2 = """[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]"""
string3 = """[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]"""
strings = [string1, string2, string3]
每个字符串都包含一个或多个“[br]”。
每个字符串可能包含也可能不包含注释。
每条注释都以“[*”开头,以“]”结尾。它可能包括双括号(“[[”和“]]”),但从不包含单个括号(“[”和“]”),所以不会有任何混淆(例如 [* 一些带有 [[brackets] 的注释]]).
我要替换的词是第一个“[br]”和注解之间的词(如果有的话,否则就是字符串的末尾),它们是
word1 = """팔짱낄 공''':'''"""
word2 = """낟알 과'''-'''"""
word3 = """둘레 곽[br]클 확"""
所以我尝试了
for string in strings:
print(re.sub(r"\[br\](.)+?(\[\*)+", "AAAA", string))
期待
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]
正则表达式的逻辑是
\[br\]
: 第一个“[br]”
(.)+?
: 我要替换的一个或多个字符,lazy
(\[\*)+
: 一个或多个 "[*"s
但结果是
[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[顆|{{{#!html}}}]]AAAA some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]]AAAA another annotation.][* another annotation.]
代替。我也试过 r"\[br\](.)+?(\[\*)*"
但还是不行。我该如何解决这个问题?
我能想到的最好办法是首先检查是否有任何注释:
import re
r = re.compile(r'''
(\[br])
(.*?)
(\[\*.*\]$)
''', re.VERBOSE)
annotation = re.compile(r'''
(\[\*.*]$)
''', re.VERBOSE)
def replace(m):
return m.group(1) + "AAAA" + m.group(3)
for s in string1, string2, string3:
print()
print(s)
if annotation.search(s):
print(r.sub(replace, s))
else:
print(re.sub(r'\[br](.*)', '[br]AAAA', s))
给出了预期的输出:
[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]
我想您可以将 if 移动到 replace 函数中,但我不确定这是否会有很大的改进。它看起来像:
import re
r = re.compile(r'''
^(?P<prefix>.*)
(?P<br>\[br].*?)
(?P<annotation>\[\*.*\])?
(?P<rest>[^\[]*)$
''', re.VERBOSE)
def replace(m):
g = m.groupdict()
if g['annotation'] is None:
return g['prefix'] + "[br]AAAA" + g['rest']
# the prefix will contain all but the last [br], thus the split...
return g['prefix'].split('[br]')[0] + "[br]AAAA" + g['annotation'] + g['rest']
for s in string1, string2, string3:
print()
print(s)
print(r.sub(replace, s))
你可以使用
^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)
模式匹配
^
字符串开头
(.*?\[br])
捕获 组 1,匹配尽可能少的字符,直到第一次出现 [br]
.+?
匹配任意字符 1+ 次
(?=
正向前瞻,断言在右边
\[\*.*?](?<!].)(?!])
匹配 [*
直到 ]
不被 ]
包围
|
或
$
断言字符串结束
)
关闭前瞻
替换为捕获组 1 和 AAAA
,如 AAAA
示例代码
import re
pattern = r"^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)"
s = ("[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''\n"
"[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', \") and brackets(\"(\", \")\", \"[[\", \"]]\").]\n"
"[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]")
subst = "AAAA"
result = re.sub(pattern, r"AAAA", s, 0, re.MULTILINE)
print(result)
输出
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]
我有多个字符串,例如
string1 = """[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''"""
string2 = """[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]"""
string3 = """[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]"""
strings = [string1, string2, string3]
每个字符串都包含一个或多个“[br]”。
每个字符串可能包含也可能不包含注释。
每条注释都以“[*”开头,以“]”结尾。它可能包括双括号(“[[”和“]]”),但从不包含单个括号(“[”和“]”),所以不会有任何混淆(例如 [* 一些带有 [[brackets] 的注释]]).
我要替换的词是第一个“[br]”和注解之间的词(如果有的话,否则就是字符串的末尾),它们是
word1 = """팔짱낄 공''':'''"""
word2 = """낟알 과'''-'''"""
word3 = """둘레 곽[br]클 확"""
所以我尝试了
for string in strings:
print(re.sub(r"\[br\](.)+?(\[\*)+", "AAAA", string))
期待
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]
正则表达式的逻辑是
\[br\]
: 第一个“[br]”
(.)+?
: 我要替换的一个或多个字符,lazy
(\[\*)+
: 一个或多个 "[*"s
但结果是
[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[顆|{{{#!html}}}]]AAAA some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]]AAAA another annotation.][* another annotation.]
代替。我也试过 r"\[br\](.)+?(\[\*)*"
但还是不行。我该如何解决这个问题?
我能想到的最好办法是首先检查是否有任何注释:
import re
r = re.compile(r'''
(\[br])
(.*?)
(\[\*.*\]$)
''', re.VERBOSE)
annotation = re.compile(r'''
(\[\*.*]$)
''', re.VERBOSE)
def replace(m):
return m.group(1) + "AAAA" + m.group(3)
for s in string1, string2, string3:
print()
print(s)
if annotation.search(s):
print(r.sub(replace, s))
else:
print(re.sub(r'\[br](.*)', '[br]AAAA', s))
给出了预期的输出:
[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]
我想您可以将 if 移动到 replace 函数中,但我不确定这是否会有很大的改进。它看起来像:
import re
r = re.compile(r'''
^(?P<prefix>.*)
(?P<br>\[br].*?)
(?P<annotation>\[\*.*\])?
(?P<rest>[^\[]*)$
''', re.VERBOSE)
def replace(m):
g = m.groupdict()
if g['annotation'] is None:
return g['prefix'] + "[br]AAAA" + g['rest']
# the prefix will contain all but the last [br], thus the split...
return g['prefix'].split('[br]')[0] + "[br]AAAA" + g['annotation'] + g['rest']
for s in string1, string2, string3:
print()
print(s)
print(r.sub(replace, s))
你可以使用
^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)
模式匹配
^
字符串开头(.*?\[br])
捕获 组 1,匹配尽可能少的字符,直到第一次出现[br]
.+?
匹配任意字符 1+ 次(?=
正向前瞻,断言在右边\[\*.*?](?<!].)(?!])
匹配[*
直到]
不被]
包围
|
或$
断言字符串结束
)
关闭前瞻
替换为捕获组 1 和 AAAA
,如 AAAA
示例代码
import re
pattern = r"^(.*?\[br]).+?(?=\[\*.*?](?<!].)(?!])|$)"
s = ("[[拱|{{{#!html}}}]][br]팔짱낄 공''':'''\n"
"[[顆|{{{#!html}}}]][br]낟알 과'''-'''[* some annotation that may include quote marks(', \") and brackets(\"(\", \")\", \"[[\", \"]]\").]\n"
"[[廓|{{{#!html}}}]][br]둘레 곽[br]클 확[* another annotation.][* another annotation.]")
subst = "AAAA"
result = re.sub(pattern, r"AAAA", s, 0, re.MULTILINE)
print(result)
输出
[[拱|{{{#!html}}}]][br]AAAA
[[顆|{{{#!html}}}]][br]AAAA[* some annotation that may include quote marks(', ") and brackets("(", ")", "[[", "]]").]
[[廓|{{{#!html}}}]][br]AAAA[* another annotation.][* another annotation.]