如何根据字符长度替换多个标签之间的文本

Question

我正在处理脏文本数据（而不是有效的 html）。我正在进行自然语言处理，不应删除短代码片段，因为它们可以包含有价值的信息，而长代码片段则不会。

这就是为什么只有当要删除的内容的字符长度为 > n.

时，我才想删除代码标签之间的文本

假设两个代码标签之间允许的字符数是 n <= 5。然后这些标签之间超过 5 个字符的所有内容都将被删除。

到目前为止，我的方法删除了所有代码字符：

text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
text = re.sub("<code>.*?</code>", '', text)
print(text)

Output: This is a string  another string  another string  another string.

期望的输出：

"This is a string <code>1234</code> another string <code>123</code> another string another string."

有没有办法在实际删除之前计算所有出现的 <code ... </code> 标签的文本长度？

Answer 1

在Python中，BeautifulSoup常用于操作HTML/XML内容。如果你使用这个库，你可以使用像

这样的东西

from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"html.parser")
text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
soup = BeautifulSoup(text,"html.parser")
for code in soup.find_all("code"):
    if len(code.encode_contents()) > 5: # Check the inner HTML length
        code.extract()                  # Remove the node found

print(str(soup))
# => This is a string <code>1234</code> another string <code>123</code> another string  another string.

请注意，这里考虑的是内部 HTML 部分的长度，而不是内部 text.

使用正则表达式，您可以使用否定字符 class 模式 [^<] 来匹配 < 以外的任何字符，并对其应用限制量词。如果应删除所有超过 5 个字符，请使用 {6,} 量词：

import re
text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
text = re.sub(r'<code>[^>]{6,}</code>', '', text)
print(text)
# => This is a string <code>1234</code> another string <code>123</code> another string  another string.

参见 this Python demo。

如何根据字符长度替换多个标签之间的文本

How to replace text between multiple tags based on character length

python

regex

string

replace

python-re