使用 Python 解析嵌套的 HTML 列表
Parsing nested HTML Lists using Python
我的 HTML 代码包含这样的嵌套列表:
<ul>
<li>Apple</li>
<li>Pear</li>
<ul>
<li>Cherry</li>
<li>Orange</li>
<ul>
<li>Pineapple</li>
</ul>
</ul>
<li>Banana</li>
</ul>
我需要解析它们,使它们看起来像这样:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana
我尝试使用 BeautifulSoup,但我对如何在我的代码中考虑嵌套感到困惑。
示例,其中 x
包含上面列出的 HTML 代码:
import bs4
soup = bs4.BeautifulSoup(x, "html.parser")
for ul in soup.find_all("ul"):
for li in ul.find_all("li"):
li.replace_with("+ {}\n".format(li.text))
这有点 hack,但您可以改用 lxml 来完成:
import lxml.html as lh
uls = """[your html above]"""
doc = lh.fromstring(uls)
tree = etree.ElementTree(doc)
for e in doc.iter('li'):
path = tree.getpath(e)
print('+' * path.count('ul'), e.text)
输出:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana
我认为使用自定义项目符号将 html
字符串转换为 markdown
会更容易。这可以通过 markdownify:
来完成
import markdownify
formatted_html = markdownify.markdownify(x, bullets=['+', '++', '+++'], strip="ul")
结果:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana
你可以使用递归:
import bs4, re
from bs4 import BeautifulSoup as soup
s = """
<ul>
<li>Apple</li>
<li>Pear</li>
<ul>
<li>Cherry</li>
<li>Orange</li>
<ul>
<li>Pineapple</li>
</ul>
</ul>
<li>Banana</li>
</ul>
"""
def indent(d, c = 0):
if (s:=''.join(i for i in d.contents if isinstance(i, bs4.NavigableString) and i.strip())):
yield f'{"+"*c} {s}'
for i in d.contents:
if not isinstance(i, bs4.NavigableString):
yield from indent(i, c+1)
print('\n'.join(indent(soup(s, 'html.parser').ul)))
输出:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana
我的 HTML 代码包含这样的嵌套列表:
<ul>
<li>Apple</li>
<li>Pear</li>
<ul>
<li>Cherry</li>
<li>Orange</li>
<ul>
<li>Pineapple</li>
</ul>
</ul>
<li>Banana</li>
</ul>
我需要解析它们,使它们看起来像这样:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana
我尝试使用 BeautifulSoup,但我对如何在我的代码中考虑嵌套感到困惑。
示例,其中 x
包含上面列出的 HTML 代码:
import bs4
soup = bs4.BeautifulSoup(x, "html.parser")
for ul in soup.find_all("ul"):
for li in ul.find_all("li"):
li.replace_with("+ {}\n".format(li.text))
这有点 hack,但您可以改用 lxml 来完成:
import lxml.html as lh
uls = """[your html above]"""
doc = lh.fromstring(uls)
tree = etree.ElementTree(doc)
for e in doc.iter('li'):
path = tree.getpath(e)
print('+' * path.count('ul'), e.text)
输出:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana
我认为使用自定义项目符号将 html
字符串转换为 markdown
会更容易。这可以通过 markdownify:
import markdownify
formatted_html = markdownify.markdownify(x, bullets=['+', '++', '+++'], strip="ul")
结果:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana
你可以使用递归:
import bs4, re
from bs4 import BeautifulSoup as soup
s = """
<ul>
<li>Apple</li>
<li>Pear</li>
<ul>
<li>Cherry</li>
<li>Orange</li>
<ul>
<li>Pineapple</li>
</ul>
</ul>
<li>Banana</li>
</ul>
"""
def indent(d, c = 0):
if (s:=''.join(i for i in d.contents if isinstance(i, bs4.NavigableString) and i.strip())):
yield f'{"+"*c} {s}'
for i in d.contents:
if not isinstance(i, bs4.NavigableString):
yield from indent(i, c+1)
print('\n'.join(indent(soup(s, 'html.parser').ul)))
输出:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana