如何将普通引号转换为 Guillemets(法语引号),标签除外
How to transform ordinary quotation marks to Guillemets (French quotes) except tags
假设我们有以下文本:
<a href="link">some link</a> How to transform "ordinary quotes" to «Guillemets»
需要的是将其转化为
<a href="link">some link</a> How to transform «ordinary quotes» to «Guillemets»
使用正则表达式和Python。
我试过了
import re
content = '<a href="link">some link</a> How to transform "ordinary quotes" to «Guillemets»'
res = re.sub('(?:"([^>]*)")(?!>)', '«\g<1>»', content)
print(res)
但是,正如@Wiktor Stribiżew 所注意到的,如果一个或多个标签具有多个属性,这将不起作用,因此
<a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»
将转换为
<a href=«link" target=»_blank">some link</a> How to transform «ordinary quotes» to «Guillemets»
更新
请注意文字
- 可以是html,即:
<div><a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»</div>
- 不能是html,即:
How to transform "ordinary quotes" to «Guillemets»
- 不能是html,但包括一些html标签,即
<a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»
这对我有用:
res = re.sub('(?:"([^>]*)")(?!>)', '«\g<1>»', content)
来自文档:
In addition to character escapes and backreferences as described
above, \g will use the substring matched by the group named name, as
defined by the (?P...) syntax. \g uses the corresponding group number;
\g<2> is therefore equivalent to , but isn’t ambiguous in a
replacement such as \g<2>0. would be interpreted as a reference to
group 20, not a reference to group 2 followed by the literal character
'0'. The backreference \g<0> substitutes in the entire substring
matched by the RE.
您愿意分三步完成吗:[a] 换掉 HTML 中的引号; [b] 将剩余的报价换成 guillemets; [c] 恢复 HTML?
中的引号
请记住,在抱怨这种速度之前,前瞻是昂贵的。
[a] first = re.sub(r'<.*?>', lambda x: re.sub(r'"', '', x.group(0)), content)
[b] second = re.sub(r'"(.*?)"', r'«»', first)
[c] third = re.sub(r'', '"', second)
Re Louis 的评论:
first = re.sub(r'<.*?>', lambda x: re.sub(r'"', 'WILLSWAPSOON', x.group(0)), content)
在某些情况下上述策略会奏效。也许 OP 正在其中一个工作。否则,如果所有这些大惊小怪太多了,OP 可以转到 BeautifulSoup 并开始玩它...
手里拿着锤子,看什么都像钉子。您不必使用正则表达式。一个简单的状态机就可以了(假设 <> 里面的任何东西都是一个 HTML 标签)。
# pos - current position in a string
# q1,q2 - opening and closing quotes position
s = ' How to transform "ordinary quotes" to «Guillemets» and " more <div><a href="link" target="_blank">some "bad" link</a>'
sl = list(s)
q1, q2 = 0, 0
pos = 0
while 1:
tag_open = s.find('<', pos)
q1 = s.find('"', pos)
if q1 < 0:
break # no more quotation marks
elif tag_open >= 0 and q1 > tag_open:
pos = s.find('>', tag_open) # tag close
elif (tag_open >= 0 and q1 < tag_open) or tag_open < 0:
q2 = s.find('"', q1 + 1)
if q2 > 0 and (tag_open < 0 or q2 < tag_open):
sl[q1] = '«'
sl[q2] = '»'
s = ''.join(sl)
pos = q2
else:
pos = q1 + 1
print(s)
解释:
Scan your string,
If not inside tag,
find first and second quotation marks,
replace accordingly,
continue scanning from the second quotation marks
Else
continue to end of tag
假设我们有以下文本:
<a href="link">some link</a> How to transform "ordinary quotes" to «Guillemets»
需要的是将其转化为
<a href="link">some link</a> How to transform «ordinary quotes» to «Guillemets»
使用正则表达式和Python。
我试过了
import re
content = '<a href="link">some link</a> How to transform "ordinary quotes" to «Guillemets»'
res = re.sub('(?:"([^>]*)")(?!>)', '«\g<1>»', content)
print(res)
但是,正如@Wiktor Stribiżew 所注意到的,如果一个或多个标签具有多个属性,这将不起作用,因此
<a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»
将转换为
<a href=«link" target=»_blank">some link</a> How to transform «ordinary quotes» to «Guillemets»
更新
请注意文字
- 可以是html,即:
<div><a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»</div>
- 不能是html,即:
How to transform "ordinary quotes" to «Guillemets»
- 不能是html,但包括一些html标签,即
<a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»
这对我有用:
res = re.sub('(?:"([^>]*)")(?!>)', '«\g<1>»', content)
来自文档:
In addition to character escapes and backreferences as described above, \g will use the substring matched by the group named name, as defined by the (?P...) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to , but isn’t ambiguous in a replacement such as \g<2>0. would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.
您愿意分三步完成吗:[a] 换掉 HTML 中的引号; [b] 将剩余的报价换成 guillemets; [c] 恢复 HTML?
中的引号请记住,在抱怨这种速度之前,前瞻是昂贵的。
[a] first = re.sub(r'<.*?>', lambda x: re.sub(r'"', '', x.group(0)), content)
[b] second = re.sub(r'"(.*?)"', r'«»', first)
[c] third = re.sub(r'', '"', second)
Re Louis 的评论:
first = re.sub(r'<.*?>', lambda x: re.sub(r'"', 'WILLSWAPSOON', x.group(0)), content)
在某些情况下上述策略会奏效。也许 OP 正在其中一个工作。否则,如果所有这些大惊小怪太多了,OP 可以转到 BeautifulSoup 并开始玩它...
手里拿着锤子,看什么都像钉子。您不必使用正则表达式。一个简单的状态机就可以了(假设 <> 里面的任何东西都是一个 HTML 标签)。
# pos - current position in a string
# q1,q2 - opening and closing quotes position
s = ' How to transform "ordinary quotes" to «Guillemets» and " more <div><a href="link" target="_blank">some "bad" link</a>'
sl = list(s)
q1, q2 = 0, 0
pos = 0
while 1:
tag_open = s.find('<', pos)
q1 = s.find('"', pos)
if q1 < 0:
break # no more quotation marks
elif tag_open >= 0 and q1 > tag_open:
pos = s.find('>', tag_open) # tag close
elif (tag_open >= 0 and q1 < tag_open) or tag_open < 0:
q2 = s.find('"', q1 + 1)
if q2 > 0 and (tag_open < 0 or q2 < tag_open):
sl[q1] = '«'
sl[q2] = '»'
s = ''.join(sl)
pos = q2
else:
pos = q1 + 1
print(s)
解释:
Scan your string,
If not inside tag,
find first and second quotation marks,
replace accordingly,
continue scanning from the second quotation marks
Else
continue to end of tag