使用 Python 解析嵌套的 HTML 列表

Parsing nested HTML Lists using Python

我的 HTML 代码包含这样的嵌套列表:

<ul>
  <li>Apple</li>
  <li>Pear</li>
  <ul>
     <li>Cherry</li>
     <li>Orange</li>
     <ul>
        <li>Pineapple</li>
     </ul>
  </ul>
  <li>Banana</li>
</ul>

我需要解析它们,使它们看起来像这样:

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

我尝试使用 BeautifulSoup,但我对如何在我的代码中考虑嵌套感到困惑。

示例,其中 x 包含上面列出的 HTML 代码:

import bs4

soup = bs4.BeautifulSoup(x, "html.parser")
for ul in soup.find_all("ul"):
    for li in ul.find_all("li"):
        li.replace_with("+ {}\n".format(li.text))

这有点 hack,但您可以改用 lxml 来完成:

import lxml.html as lh

uls = """[your html above]"""
doc = lh.fromstring(uls)
tree = etree.ElementTree(doc)
for e in doc.iter('li'):
        path = tree.getpath(e)
        print('+' * path.count('ul'), e.text)

输出:

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

我认为使用自定义项目符号将 html 字符串转换为 markdown 会更容易。这可以通过 markdownify:

来完成
import markdownify

formatted_html = markdownify.markdownify(x, bullets=['+', '++', '+++'], strip="ul")

结果:

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

你可以使用递归:

import bs4, re
from bs4 import BeautifulSoup as soup
s = """
<ul>
  <li>Apple</li>
  <li>Pear</li>
  <ul>
     <li>Cherry</li>
     <li>Orange</li>
     <ul>
        <li>Pineapple</li>
     </ul>
  </ul>
  <li>Banana</li>
</ul>
"""
def indent(d, c = 0):
   if (s:=''.join(i for i in d.contents if isinstance(i, bs4.NavigableString) and i.strip())):
       yield f'{"+"*c} {s}'
   for i in d.contents:
      if not isinstance(i, bs4.NavigableString):
         yield from indent(i, c+1)

print('\n'.join(indent(soup(s, 'html.parser').ul)))

输出:

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana