高效的正则表达式,用于减少 Python 中由特定分隔符分隔的完全重复的短语
Efficient regex for reducing only fully duplicate phrases separated by a specific delimiter in Python
假设我有一个如下所示的购物清单:
lines="""
''[[excellent wheat|excellent wheat]]''
''[[brillant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""
只有当产品是完全重复的时,购物清单上的产品才应该减少到非重复的,即''[[excellent wheat|excellent wheat] ]'' -> ''[[优质小麦]]''。不完整的副本应保持原样。
我已经查看了一些 ,但找不到理想的解决方案。
我想像这样逐行评估多行字符串的一部分,
for i in range(0,100):
lines[i] = regexHere(lines[i]) #regex expr here
print lines[i]
我希望得到以下输出:
''[[excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
谢谢。
编辑:这适用于给定的示例。如果购物清单在一个包含其他格式的随机行的清单中怎么办?
lines="""
==///Listings==/?
assadsadsadsa
adasdsad
</test>
''[[excellent wheat|excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
</separation>
Remember to purchase this if on offer
''[[jub|jub/ha]]'',
''[[barley|barley/hops]]''
zcxcxzcxz
"""
为此,您真的不需要正则表达式——您可以直接使用字符串操作:
lines="""
''[[excellent wheat|excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""
for line in lines.strip().split("\n"):
first, second = line.split('|')
if first[4:] == second[:-4]:
print("''[[{}]]''".format(''.join(first[4:])))
else:
print(line)
"""
Output:
''[[excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""
你可以这样做:
lines="""
''[[excellent wheat|excellent wheat]]''
''[[brillant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""
>>> print(re.sub(r'(?<=\[)([^[]*)(?=\|)\|(?=\])', r'', lines))
''[[excellent wheat]]''
''[[brillant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
如果您想要更高的效率,您可以将simpler regex(无回溯)与一些Python 字符串处理相结合。老实说,我不知道这是 更快 还是不是:
lines="""
==///Listings==/?
assadsadsadsa
adasdsad
</test>
''[[excellent wheat|excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
</separation>
Remember to purchase this if on offer
''[[jub|jub/ha]]'',
''[[barley|barley/hops]]''
zcxcxzcxz
""".splitlines()
# Python 3.8+ because of the walrus. Break into two line if can't use that
for i, line in enumerate(lines):
if m:=re.search(r'(?<=\[\[)([^\]\[]*)(?=\]\])', line):
x=m.group(1).partition('|')
if x[0]==x[2]:
span=m.span()
lines[i]=line[0:span[0]]+x[0]+line[span[1]:]
print('\n'.join(lines))
打印:
==///Listings==/?
assadsadsadsa
adasdsad
</test>
''[[excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
</separation>
Remember to purchase this if on offer
''[[jub|jub/ha]]'',
''[[barley|barley/hops]]''
zcxcxzcxz
假设我有一个如下所示的购物清单:
lines="""
''[[excellent wheat|excellent wheat]]''
''[[brillant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""
只有当产品是完全重复的时,购物清单上的产品才应该减少到非重复的,即''[[excellent wheat|excellent wheat] ]'' -> ''[[优质小麦]]''。不完整的副本应保持原样。
我已经查看了一些
我想像这样逐行评估多行字符串的一部分,
for i in range(0,100):
lines[i] = regexHere(lines[i]) #regex expr here
print lines[i]
我希望得到以下输出:
''[[excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
谢谢。
编辑:这适用于给定的示例。如果购物清单在一个包含其他格式的随机行的清单中怎么办?
lines="""
==///Listings==/?
assadsadsadsa
adasdsad
</test>
''[[excellent wheat|excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
</separation>
Remember to purchase this if on offer
''[[jub|jub/ha]]'',
''[[barley|barley/hops]]''
zcxcxzcxz
"""
为此,您真的不需要正则表达式——您可以直接使用字符串操作:
lines="""
''[[excellent wheat|excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""
for line in lines.strip().split("\n"):
first, second = line.split('|')
if first[4:] == second[:-4]:
print("''[[{}]]''".format(''.join(first[4:])))
else:
print(line)
"""
Output:
''[[excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""
你可以这样做:
lines="""
''[[excellent wheat|excellent wheat]]''
''[[brillant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
"""
>>> print(re.sub(r'(?<=\[)([^[]*)(?=\|)\|(?=\])', r'', lines))
''[[excellent wheat]]''
''[[brillant corn|Tom's brilliant corn]]''
''[[spicy chips|spicy fries/chips]]''
如果您想要更高的效率,您可以将simpler regex(无回溯)与一些Python 字符串处理相结合。老实说,我不知道这是 更快 还是不是:
lines="""
==///Listings==/?
assadsadsadsa
adasdsad
</test>
''[[excellent wheat|excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
</separation>
Remember to purchase this if on offer
''[[jub|jub/ha]]'',
''[[barley|barley/hops]]''
zcxcxzcxz
""".splitlines()
# Python 3.8+ because of the walrus. Break into two line if can't use that
for i, line in enumerate(lines):
if m:=re.search(r'(?<=\[\[)([^\]\[]*)(?=\]\])', line):
x=m.group(1).partition('|')
if x[0]==x[2]:
span=m.span()
lines[i]=line[0:span[0]]+x[0]+line[span[1]:]
print('\n'.join(lines))
打印:
==///Listings==/?
assadsadsadsa
adasdsad
</test>
''[[excellent wheat]]''
''[[brilliant corn|Tom's brilliant corn]]''
</separation>
Remember to purchase this if on offer
''[[jub|jub/ha]]'',
''[[barley|barley/hops]]''
zcxcxzcxz