使用 BeautifulSoup 查找自定义 HTML 标签

Question

我正在尝试使用 BeautifulSoup 在 HTML 页面上定位 Gliffy 图。 HTML 页面的源代码大致如下所示：

<p>Lorem ipsum dolor sit amet</p>
<p>Figure: Consectetur adipiscing elit</p>
<p>
   <ac:structured-macro ac:macro-id="a9ab423b-b68c-4836-bffa-cdf1c5b95392" ac:name="gliffy" ac:schema-version="1">
      <ac:parameter ac:name="displayName">Sed do eiusmod</ac:parameter>
      <ac:parameter ac:name="name">Tempor incididunt ut</ac:parameter>
      <ac:parameter ac:name="pagePin">2</ac:parameter>
   </ac:structured-macro>
</p>
<p><br/></p>

我想在页面中定位<ac:structured-macro ac:macro-id="a9ab423b-b68c-4836-bffa-cdf1c5b95392" ac:name="gliffy" ac:schema-version="1">，但是不会使用soup.find_all('ac:structured-macro')这样笼统的语句，因为Confluence中使用的宏种类繁多，我想准确定位ac:name="gliffy" 排除所有其他可能性的宏。

但是，这看起来不像标准的 HTML 标签。我不确定 BeautifulSoup 是正确的选择。我应该使用 lxml 等其他库吗？不管怎样，请让我知道我应该使用什么库和什么功能，以及我应该如何调用才能在这个 HTML 页面中准确定位 Gliffy 图。谢谢。

Answer 1

对于 xml 数据你仍然可以使用 BeautifulSoup 但你需要下载 lxml 解析器，而不是在标准库中。

pip install lxml

这里有一个关于如何查找代码的示例：

from bs4 import BeautifulSoup

html = """<p>Lorem ipsum dolor sit amet</p>
<p>Figure: Consectetur adipiscing elit</p>
<p>
    <ac:structured-macro ac:macro-id="a9ab423b-b68c-4836-bffa-cdf1c5b95392" ac:name="gliffy" ac:schema-version="1">
    <ac:parameter ac:name="displayName">Sed do eiusmod</ac:parameter>
    <ac:parameter ac:name="name">Tempor incididunt ut</ac:parameter>
    <ac:parameter ac:name="pagePin">2</ac:parameter>
    </ac:structured-macro>
</p>
<p><br/></p>"""


soup = BeautifulSoup(html, "lxml")

for tag in soup.find_all(attrs={"ac:name": "gliffy"}):
   print(tag)

使用 BeautifulSoup 查找自定义 HTML 标签

Use BeautifulSoup to Find a Custom HTML Tag

html

python

beautifulsoup

confluence