使用 Python 和 BeautifulSoup,select 仅未包含在 <a> 中的文本节点
Using Python and BeautifulSoup, select only text nodes that are NOT wrapped in <a>
我正在尝试解析一些我可以 urlize(用标签包装)link未格式化的文本。下面是一些示例文本:
text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>, and this is a link too but not formatted: https://google.com</p>'
这是我目前从 here 得到的:
from django.utils.html import urlize
from bs4 import BeautifulSoup
...
def urlize_html(text):
soup = BeautifulSoup(text, "html.parser")
textNodes = soup.findAll(text=True)
for textNode in textNodes:
urlizedText = urlize(textNode)
textNode.replaceWith(urlizedText)
return = str(soup)
但这也会捕获示例中的中间 link,导致它被双重包裹在 <a>
标记中。结果是这样的:
<p>This is a <a href="https://djangosnippets.org/snippets/2072/" target="_blank">link</a>, this is also a link where the test is the same as the link: <a href="https://djangosnippets.org/snippets/2072/" target="_blank"><a href="https://djangosnippets.org/snippets/2072/">https://djangosnippets.org/snippets/2072/</a></a>, and this is a link too but not formatted: <a href="https://djangosnippets.org/snippets/2072/">https://djangosnippets.org/snippets/2072/</a></p>
我该如何处理 textNodes = soup.findAll(text=True)
以便它只包含尚未包含在 <a>
标签中的文本节点?
文本节点保留其 parent
引用,因此您可以只测试 a
标签:
for textNode in textNodes:
if textNode.parent and getattr(textNode.parent, 'name') == 'a':
continue # skip links
urlizedText = urlize(textNode)
textNode.replaceWith(urlizedText)
我正在尝试解析一些我可以 urlize(用标签包装)link未格式化的文本。下面是一些示例文本:
text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>, and this is a link too but not formatted: https://google.com</p>'
这是我目前从 here 得到的:
from django.utils.html import urlize
from bs4 import BeautifulSoup
...
def urlize_html(text):
soup = BeautifulSoup(text, "html.parser")
textNodes = soup.findAll(text=True)
for textNode in textNodes:
urlizedText = urlize(textNode)
textNode.replaceWith(urlizedText)
return = str(soup)
但这也会捕获示例中的中间 link,导致它被双重包裹在 <a>
标记中。结果是这样的:
<p>This is a <a href="https://djangosnippets.org/snippets/2072/" target="_blank">link</a>, this is also a link where the test is the same as the link: <a href="https://djangosnippets.org/snippets/2072/" target="_blank"><a href="https://djangosnippets.org/snippets/2072/">https://djangosnippets.org/snippets/2072/</a></a>, and this is a link too but not formatted: <a href="https://djangosnippets.org/snippets/2072/">https://djangosnippets.org/snippets/2072/</a></p>
我该如何处理 textNodes = soup.findAll(text=True)
以便它只包含尚未包含在 <a>
标签中的文本节点?
文本节点保留其 parent
引用,因此您可以只测试 a
标签:
for textNode in textNodes:
if textNode.parent and getattr(textNode.parent, 'name') == 'a':
continue # skip links
urlizedText = urlize(textNode)
textNode.replaceWith(urlizedText)