Python re.sub 只匹配第一次出现

Python re.sub only match first occurrence

我正在尝试转义字符串中的双引号以准备它被 json.loads 加载。下面的代码试图找出正确的方法。

import re

one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'

print re.sub('("caption":".*?)"(.*?",")', r'\"', one)
print re.sub('("caption":".*?)"(.*?",")', r'\"', two)

这是当前输出。

"caption":"This caption should not match nor have any double quotes escaped","
"caption":"This caption \"should have "the duobles quotes" in the caption escaped"","

问题是只有第二个字符串中的第一个双引号被转义。我意识到我的正则表达式中有一个错误,这并不是我的强项。我在这里阅读了大量主题,并在 google 上花费了大量时间但无济于事。

请注意,我使用的实际字符串长约 10 000 个字符,并且其中多次出现两种类型的字幕字符串。

>>> import re
>>> one = '"caption":"This caption should not match nor have any double quotes escaped","'
>>> two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
>>> match = re.match(r"(\"caption\"\:\")(.*)(\",\")", two)
>>> midstr = match.group(2).replace('"', u'\u005C"')
>>> newstr = "".join([match.group(1), midstr, match.group(3)])
>>> print newstr
"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"","
import re

expression = """
(             # Capturing group 1
[\w ]         # The quote should be preceeded by a word char or space.
)             # End group

(")           # Capturing group 2: match a quote character.

(             # Capturing group 3
[^,:]         # Quote shuold not be followed by a comma or colon.
)             # End group
"""
pattern = re.compile(expression, re.VERBOSE)

result = pattern.sub(r'\"', one)
print(result)

Demo 更新错误修复。

我会尝试 re.sub 如下-

one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
result= re.sub(r"""(?<!^)(?<!:)(")(?!$)(?!:)""",r'\',two)
print result

输出-

"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"\","

直播DEMO

正则表达式解释

只抓取所有不在第 start/end 行的引号,而不是第一个 : 之前或之后的引号,然后用预先结束的反斜杠替换它们(即 \"

如果您已安装 regex package(如评论中所述),这应该有效:

result = regex.sub(r'(?<="caption":".*)"(?=.*",")', r'\"', subject)

如您所见,正则表达式与您的相同,只是我将您的捕获组更改为环视。由于不再使用字符串的那些部分,因此无需将它们重新插入新字符串,因此替换只是 \".

我不能说这个正则表达式的效率,因为我对周围的文本一无所知。如果目标字符串在它们自己的行上,只要不指定 DOTALL 模式就应该没问题。但最安全的方法是先提取字符串并单独处理它们。

# fourth parameter is the position; following will remove 1st occurrence of "so"
sent = 'we are having so so much of fun'
re.sub("so",'', sent, 1)