BeautifulSoup replaceWith() 方法添加转义 html，希望它不转义

Question

我有一个 python 方法 (thank to this snippet)，它需要一些 html 并使用 BeautifulSoup 和 Django 的 urlize:

from django.utils.html import urlize
from bs4 import BeautifulSoup

def html_urlize(self, text):
    soup = BeautifulSoup(text, "html.parser")

    print(soup)

    textNodes = soup.findAll(text=True)
    for textNode in textNodes:
        if textNode.parent and getattr(textNode.parent, 'name') == 'a':
            continue  # skip already formatted links
        urlizedText = urlize(textNode)
        textNode.replaceWith(urlizedText)

    print(soup)

    return str(soup)

示例输入文本（第一个 print 语句的输出）是这样的：

this is a formatted link <a href="http://google.ca">http://google.ca</a>, this one is unformatted and should become formatted: http://google.ca

生成的 return 文本（由第二个 print 语句输出）是这样的：

this is a formatted link <a href="http://google.ca">http://google.ca</a>, this one is unformatted and should become formatted: &lt;a href="http://google.ca"&gt;http://google.ca&lt;/a&gt;

如您所见，它正在格式化 link，但它是使用转义 html 进行的，所以当我在模板 {{ my.html|safe }} 中打印它时，它不会呈现作为 html.

那么我怎样才能让这些添加了 urlize 的标签不被转义，并正确呈现为 html？我怀疑这与我使用它作为方法而不是模板过滤器有关吗？我实际上找不到有关此方法的文档，它没有出现在 django.utils.html.

中

编辑：看来转义实际上发生在这一行：textNode.replaceWith(urlizedText)。

Answer 1

这似乎是您尝试使用 BeautifulSoup 将文本节点替换为包含 HTML 个实体的文本节点的地方。

一种实现您想要的方法是使用 urlize 的输出构建一个新字符串（它似乎并不关心 link 是否已经格式化）。

from django.utils.html import urlize
from bs4 import BeautifulSoup

def html_urlize(self, text):
    soup = BeautifulSoup(text, "html.parser")

    finalFragments = []
    textNodes = soup.findAll(text=True)
    for textNode in textNodes:
        if getattr(textNode.parent, 'name') == 'a':
            finalFragments.append(str(textNode.parent))
        else:
            finalFragments.append(urlize(textNode))

    return str("".join(finalFragments))

但是，如果您只想在模板中呈现它，您只需将输入字符串作为模板标记调用 urlize -

{{input_string|urlize}}

Answer 2

您可以将您的 urlizedText 字符串转换为一个新的 BeautifulSoup 对象，它本身将被视为一个标签，而不是一个内的文本（如您所期望的那样被转义）

from django.utils.html import urlize
from bs4 import BeautifulSoup

def html_urlize(self, text):
    soup = BeautifulSoup(text, "html.parser")

    print(soup)

    textNodes = soup.findAll(text=True)
    for textNode in textNodes:
        if textNode.parent and getattr(textNode.parent, 'name') == 'a':
            continue  # skip already formatted links
        urlizedText = urlize(textNode)
        textNode.replaceWith(BeautifulSoup(urlizedText, "html.parser"))

    print(soup)

    return str(soup)

BeautifulSoup replaceWith() 方法添加转义 html，希望它不转义

BeautifulSoup replaceWith() method adding escaped html, want it unescaped

python

django

beautifulsoup