使用 Python 解析嵌套的 HTML 列表

Question

我的 HTML 代码包含这样的嵌套列表：

<ul>
  <li>Apple</li>
  <li>Pear</li>
  <ul>
     <li>Cherry</li>
     <li>Orange</li>
     <ul>
        <li>Pineapple</li>
     </ul>
  </ul>
  <li>Banana</li>
</ul>

我需要解析它们，使它们看起来像这样：

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

我尝试使用 BeautifulSoup，但我对如何在我的代码中考虑嵌套感到困惑。

示例，其中 x 包含上面列出的 HTML 代码：

import bs4

soup = bs4.BeautifulSoup(x, "html.parser")
for ul in soup.find_all("ul"):
    for li in ul.find_all("li"):
        li.replace_with("+ {}\n".format(li.text))

Answer 1

这有点 hack，但您可以改用 lxml 来完成：

import lxml.html as lh

uls = """[your html above]"""
doc = lh.fromstring(uls)
tree = etree.ElementTree(doc)
for e in doc.iter('li'):
        path = tree.getpath(e)
        print('+' * path.count('ul'), e.text)

输出：

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

Answer 2

我认为使用自定义项目符号将 html 字符串转换为 markdown 会更容易。这可以通过 markdownify:

来完成

import markdownify

formatted_html = markdownify.markdownify(x, bullets=['+', '++', '+++'], strip="ul")

结果：

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

Answer 3

你可以使用递归：

import bs4, re
from bs4 import BeautifulSoup as soup
s = """
<ul>
  <li>Apple</li>
  <li>Pear</li>
  <ul>
     <li>Cherry</li>
     <li>Orange</li>
     <ul>
        <li>Pineapple</li>
     </ul>
  </ul>
  <li>Banana</li>
</ul>
"""
def indent(d, c = 0):
   if (s:=''.join(i for i in d.contents if isinstance(i, bs4.NavigableString) and i.strip())):
       yield f'{"+"*c} {s}'
   for i in d.contents:
      if not isinstance(i, bs4.NavigableString):
         yield from indent(i, c+1)

print('\n'.join(indent(soup(s, 'html.parser').ul)))

输出：

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

使用 Python 解析嵌套的 HTML 列表

Parsing nested HTML Lists using Python

html

python

parsing

beautifulsoup