Python re.sub 只匹配第一次出现
Python re.sub only match first occurrence
我正在尝试转义字符串中的双引号以准备它被 json.loads 加载。下面的代码试图找出正确的方法。
import re
one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
print re.sub('("caption":".*?)"(.*?",")', r'\"', one)
print re.sub('("caption":".*?)"(.*?",")', r'\"', two)
这是当前输出。
"caption":"This caption should not match nor have any double quotes escaped","
"caption":"This caption \"should have "the duobles quotes" in the caption escaped"","
问题是只有第二个字符串中的第一个双引号被转义。我意识到我的正则表达式中有一个错误,这并不是我的强项。我在这里阅读了大量主题,并在 google 上花费了大量时间但无济于事。
请注意,我使用的实际字符串长约 10 000 个字符,并且其中多次出现两种类型的字幕字符串。
>>> import re
>>> one = '"caption":"This caption should not match nor have any double quotes escaped","'
>>> two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
>>> match = re.match(r"(\"caption\"\:\")(.*)(\",\")", two)
>>> midstr = match.group(2).replace('"', u'\u005C"')
>>> newstr = "".join([match.group(1), midstr, match.group(3)])
>>> print newstr
"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"","
import re
expression = """
( # Capturing group 1
[\w ] # The quote should be preceeded by a word char or space.
) # End group
(") # Capturing group 2: match a quote character.
( # Capturing group 3
[^,:] # Quote shuold not be followed by a comma or colon.
) # End group
"""
pattern = re.compile(expression, re.VERBOSE)
result = pattern.sub(r'\"', one)
print(result)
Demo
更新错误修复。
我会尝试 re.sub
如下-
one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
result= re.sub(r"""(?<!^)(?<!:)(")(?!$)(?!:)""",r'\',two)
print result
输出-
"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"\","
直播DEMO
正则表达式解释
只抓取所有不在第 start/end 行的引号,而不是第一个 :
之前或之后的引号,然后用预先结束的反斜杠替换它们(即 \"
)
如果您已安装 regex package(如评论中所述),这应该有效:
result = regex.sub(r'(?<="caption":".*)"(?=.*",")', r'\"', subject)
如您所见,正则表达式与您的相同,只是我将您的捕获组更改为环视。由于不再使用字符串的那些部分,因此无需将它们重新插入新字符串,因此替换只是 \"
.
我不能说这个正则表达式的效率,因为我对周围的文本一无所知。如果目标字符串在它们自己的行上,只要不指定 DOTALL 模式就应该没问题。但最安全的方法是先提取字符串并单独处理它们。
# fourth parameter is the position; following will remove 1st occurrence of "so"
sent = 'we are having so so much of fun'
re.sub("so",'', sent, 1)
我正在尝试转义字符串中的双引号以准备它被 json.loads 加载。下面的代码试图找出正确的方法。
import re
one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
print re.sub('("caption":".*?)"(.*?",")', r'\"', one)
print re.sub('("caption":".*?)"(.*?",")', r'\"', two)
这是当前输出。
"caption":"This caption should not match nor have any double quotes escaped","
"caption":"This caption \"should have "the duobles quotes" in the caption escaped"","
问题是只有第二个字符串中的第一个双引号被转义。我意识到我的正则表达式中有一个错误,这并不是我的强项。我在这里阅读了大量主题,并在 google 上花费了大量时间但无济于事。
请注意,我使用的实际字符串长约 10 000 个字符,并且其中多次出现两种类型的字幕字符串。
>>> import re
>>> one = '"caption":"This caption should not match nor have any double quotes escaped","'
>>> two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
>>> match = re.match(r"(\"caption\"\:\")(.*)(\",\")", two)
>>> midstr = match.group(2).replace('"', u'\u005C"')
>>> newstr = "".join([match.group(1), midstr, match.group(3)])
>>> print newstr
"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"","
import re
expression = """
( # Capturing group 1
[\w ] # The quote should be preceeded by a word char or space.
) # End group
(") # Capturing group 2: match a quote character.
( # Capturing group 3
[^,:] # Quote shuold not be followed by a comma or colon.
) # End group
"""
pattern = re.compile(expression, re.VERBOSE)
result = pattern.sub(r'\"', one)
print(result)
Demo 更新错误修复。
我会尝试 re.sub
如下-
one = '"caption":"This caption should not match nor have any double quotes escaped","'
two = '"caption":"This caption "should have "the duobles quotes" in the caption escaped"","'
result= re.sub(r"""(?<!^)(?<!:)(")(?!$)(?!:)""",r'\',two)
print result
输出-
"caption":"This caption \"should have \"the duobles quotes\" in the caption escaped\"\","
直播DEMO
正则表达式解释
只抓取所有不在第 start/end 行的引号,而不是第一个 :
之前或之后的引号,然后用预先结束的反斜杠替换它们(即 \"
)
如果您已安装 regex package(如评论中所述),这应该有效:
result = regex.sub(r'(?<="caption":".*)"(?=.*",")', r'\"', subject)
如您所见,正则表达式与您的相同,只是我将您的捕获组更改为环视。由于不再使用字符串的那些部分,因此无需将它们重新插入新字符串,因此替换只是 \"
.
我不能说这个正则表达式的效率,因为我对周围的文本一无所知。如果目标字符串在它们自己的行上,只要不指定 DOTALL 模式就应该没问题。但最安全的方法是先提取字符串并单独处理它们。
# fourth parameter is the position; following will remove 1st occurrence of "so"
sent = 'we are having so so much of fun'
re.sub("so",'', sent, 1)