如何将多个 table 转换为无序列表的项目,其中每个 table 是一个 <li>?

How can I transform multiple tables to an unordered list of items, where each table is a <li>?

我正在尝试修复 HTML 文件。它有多个 table 条目,我想将其转换为 table 内容的 "ul li"。

我已经尝试找到所有 "table" 标签并将它们替换为 "li"(请参阅下面的代码)但不能 "wrap" 列表 "ul" 之间

<p> Hello world!</p>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>First bullet point text</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Second</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Third</p></td></tr></table>
<table><tr><td>&nbsp;</td><td">&bull;</td><td><p>Last</p></td></tr></table>
<p>Some paragraph</p>
<table>&nbsp;</td><td>&bull;</td><td><p>1st item of 2nd list</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>2nd item of 2nd list</p></td></tr></table>
<p>Another paragraph</p>

我做了以下事情:

def replaceBullets(soup):
    if soup.find('table'):
        for table in soup.findAll('table'):
            if isUnordered(table.text):
                replacement = soup.new_tag("li")
                replacement.string = table.p.text
                table.replace_with(replacement)

def isUnordered(line):
    if u'\u2022' in line and u'\xa0' in line:
        return True
    return False

我想得到:

<p>Hello world!</p>
<ul><li>First bullet point text</li>
<li>Second</li>
<li>Third</li>
<li>Last</li></ul>
<p>Some paragraph</p>
<ul><li>1st item of 2nd list</li>
<li>2nd item of 2nd list</li></ul>
<p>Another paragraph</p>

但我找不到插入 "ul" 标签的方法

哇,这是一项繁琐的任务,但我终于设法做到了。我使用 find 函数和过滤函数来查找 table.

中的 <p> 元素

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function

请注意,我已经修复了您发布的 HTML 中格式错误的部分。

from bs4 import BeautifulSoup, Tag

if __name__ == "__main__":

    html = '''
    <p>Hello world!</p>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>First bullet point text</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Second</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Third</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>Last</p></td></tr></table>
<p>Some paragraph</p>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>1st item of 2nd list</p></td></tr></table>
<table><tr><td>&nbsp;</td><td>&bull;</td><td><p>2nd item of 2nd list</p></td></tr></table>
<p>Another paragraph</p>
    '''

    soup = BeautifulSoup(html, 'html.parser')

    # find all <p>s under a table and replace table with the <p> element
    def p_under_table_extractor(el: Tag):
        table_parent = el.find_parent('table')
        return el.name == 'p' and table_parent

    for p in soup.find_all(p_under_table_extractor):
        table_parent = p.find_parent('table')
        p.name = 'li'
        table_parent.replace_with(p)

    # the only <p>s are the root <p>s
    for p in soup.find_all('p'):
        # find all succeeding <li>s
        li_els = []
        for el in p.find_all_next():
            if el.name != 'li':
                break
            else:
                li_els.append(el)
        # put those <li>s inside a <ul>
        if li_els:
            ul = soup.new_tag('ul')
            for li in li_els:
                ul.append(li)
            # and put <ul> after the <p>
            p.insert_after(ul)

    print(soup.prettify())

打印:

<p>Hello world!</p>
<ul>
    <li>First bullet point text</li>
    <li>Second</li>
    <li>Third</li>
    <li>Last</li>
</ul>
<p>Some paragraph</p>
<ul>
    <li>1st item of 2nd list</li>
    <li>2nd item of 2nd list</li>
</ul>
<p>Another paragraph</p>