如何使用 re 模块将特定 link 标签更改为文本？

Question

我有 HTML 文本。例如：

<a href="https://google.com">Google</a> Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua.<br />
<br />
#<a href="#something">somethin</a> #<a href="#somethingelse">somethinelse</a>

我想将以“#”开头的链接更改为普通文本（例如带有 <b></b> 标签）。其他链接应该不变。

我尝试使用re模块，结果不太成功

import re

cond = re.compile('#<.*?>')
output = re.sub(cond, "#", "#<a href=\"stuff1\">stuff1</a>")
print(output)

输出：

#stuff1</a>

最后还有</a>。

Answer 1

你很接近！您的模式 '#<.*?>' 仅匹配开始标记。试试这个：

r'#<a href=".*?">(.*?)</a>'

这也更具体一点，因为它只会匹配 <a> 标签。另请注意，最好将正则表达式指定为 raw string literals (the r at the beginning). The parentheses, (.*?), are a capturing group. From the docs:

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below.

您可以在替换参数中将此组引用为 \g<#>，其中 # 是您想要的组。我们只定义了一组，自然是第一个：\g<1>.

此外，编译正则表达式后，您可以调用它自己的 sub 方法：

pattern = re.compile(r'my pattern')
pattern.sub(r'replacement', 'text')

通常re.sub方法用于还没有编译的时候：

re.sub(r'my pattern', r'replacement', 'text')

性能差异通常为 none or minimal，因此请使用使您的代码更清晰的那个。（就我个人而言，我通常更喜欢编译。与任何其他变量一样，编译表达式让我可以使用清晰、可重用的名称。）

因此您的代码将是：

import re

pound_links = re.compile(r'#<a href=".*?">(.*?)</a>')
output = pound_links.sub(r'#\g<1>', '#<a href="stuff1">stuff1</a>')

print(output)

或：

import re

output = re.sub(r'#<a href=".*?">(.*?)</a>',
                r"#\g<1>",
                "#<a href=\"stuff1\">stuff1</a>")

print(output)

任一输出：

#stuff1

如何使用 re 模块将特定 link 标签更改为文本？

How to change specific link tags to text using re module?

python

regex

python-re