如何通过 python 忽略正则表达式中的 html 注释标签

Question

我正在用一些 asci 代码替换特殊字符并在下面的正则表达式

的帮助下忽略 html 标签

text_list = re.findall(r'>([\S\s]*?)<', html)

所以它忽略了我们想要的所有 html 标签，但没有忽略 html 评论结束标签“-->”。

感谢任何帮助。我应该在正则表达式中更改什么。

附上截图供您参考。

Answer 1

您可以使用 re.findall:

匹配并丢弃它们

text_list = list(filter(None, re.findall(r'(?s)<!--.*?-->|>(.*?)<', html)))
# Or, a bit more efficient:
text_list = list(filter(None, re.findall(r'<!--[^-]*(?:-(?!->)[^-]*)*-->|>([^<]*)<', html)))

参见 this regex demo (and the second one)。

正则表达式匹配之间的子串，匹配<和>之间的子串，捕获文本如果模式中有捕获组，则后两个定界符到第 1 组和 re.findall 之间仅 returns 捕获。

参见 Python demo:

import re
html = "<a href='link.html'>URL</a>Some text <!-- Comment --><p>Par here</p>More text"
text_list = list(filter(None, re.findall(r'(?s)<!--.*?-->|>(.*?)<', html)))
print(text_list)
# => ['URL', 'Some text ', 'Par here']

Answer 2

请在读取文件时尝试，请传递多个编码参数

如何通过 python 忽略正则表达式中的 html 注释标签

How to Ignore html comment tag in regex through python

html

python

regex

regexp-replace