如何将多个 table 转换为无序列表的项目,其中每个 table 是一个 <li>?
How can I transform multiple tables to an unordered list of items, where each table is a <li>?
我正在尝试修复 HTML 文件。它有多个 table 条目,我想将其转换为 table 内容的 "ul li"。
我已经尝试找到所有 "table" 标签并将它们替换为 "li"(请参阅下面的代码)但不能 "wrap" 列表 "ul" 之间
<p> Hello world!</p>
<table><tr><td> </td><td>•</td><td><p>First bullet point text</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Second</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Third</p></td></tr></table>
<table><tr><td> </td><td">•</td><td><p>Last</p></td></tr></table>
<p>Some paragraph</p>
<table> </td><td>•</td><td><p>1st item of 2nd list</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>2nd item of 2nd list</p></td></tr></table>
<p>Another paragraph</p>
我做了以下事情:
def replaceBullets(soup):
if soup.find('table'):
for table in soup.findAll('table'):
if isUnordered(table.text):
replacement = soup.new_tag("li")
replacement.string = table.p.text
table.replace_with(replacement)
def isUnordered(line):
if u'\u2022' in line and u'\xa0' in line:
return True
return False
我想得到:
<p>Hello world!</p>
<ul><li>First bullet point text</li>
<li>Second</li>
<li>Third</li>
<li>Last</li></ul>
<p>Some paragraph</p>
<ul><li>1st item of 2nd list</li>
<li>2nd item of 2nd list</li></ul>
<p>Another paragraph</p>
但我找不到插入 "ul" 标签的方法
哇,这是一项繁琐的任务,但我终于设法做到了。我使用 find
函数和过滤函数来查找 table.
中的 <p>
元素
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function
请注意,我已经修复了您发布的 HTML 中格式错误的部分。
from bs4 import BeautifulSoup, Tag
if __name__ == "__main__":
html = '''
<p>Hello world!</p>
<table><tr><td> </td><td>•</td><td><p>First bullet point text</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Second</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Third</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Last</p></td></tr></table>
<p>Some paragraph</p>
<table><tr><td> </td><td>•</td><td><p>1st item of 2nd list</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>2nd item of 2nd list</p></td></tr></table>
<p>Another paragraph</p>
'''
soup = BeautifulSoup(html, 'html.parser')
# find all <p>s under a table and replace table with the <p> element
def p_under_table_extractor(el: Tag):
table_parent = el.find_parent('table')
return el.name == 'p' and table_parent
for p in soup.find_all(p_under_table_extractor):
table_parent = p.find_parent('table')
p.name = 'li'
table_parent.replace_with(p)
# the only <p>s are the root <p>s
for p in soup.find_all('p'):
# find all succeeding <li>s
li_els = []
for el in p.find_all_next():
if el.name != 'li':
break
else:
li_els.append(el)
# put those <li>s inside a <ul>
if li_els:
ul = soup.new_tag('ul')
for li in li_els:
ul.append(li)
# and put <ul> after the <p>
p.insert_after(ul)
print(soup.prettify())
打印:
<p>Hello world!</p>
<ul>
<li>First bullet point text</li>
<li>Second</li>
<li>Third</li>
<li>Last</li>
</ul>
<p>Some paragraph</p>
<ul>
<li>1st item of 2nd list</li>
<li>2nd item of 2nd list</li>
</ul>
<p>Another paragraph</p>
我正在尝试修复 HTML 文件。它有多个 table 条目,我想将其转换为 table 内容的 "ul li"。
我已经尝试找到所有 "table" 标签并将它们替换为 "li"(请参阅下面的代码)但不能 "wrap" 列表 "ul" 之间
<p> Hello world!</p>
<table><tr><td> </td><td>•</td><td><p>First bullet point text</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Second</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Third</p></td></tr></table>
<table><tr><td> </td><td">•</td><td><p>Last</p></td></tr></table>
<p>Some paragraph</p>
<table> </td><td>•</td><td><p>1st item of 2nd list</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>2nd item of 2nd list</p></td></tr></table>
<p>Another paragraph</p>
我做了以下事情:
def replaceBullets(soup):
if soup.find('table'):
for table in soup.findAll('table'):
if isUnordered(table.text):
replacement = soup.new_tag("li")
replacement.string = table.p.text
table.replace_with(replacement)
def isUnordered(line):
if u'\u2022' in line and u'\xa0' in line:
return True
return False
我想得到:
<p>Hello world!</p>
<ul><li>First bullet point text</li>
<li>Second</li>
<li>Third</li>
<li>Last</li></ul>
<p>Some paragraph</p>
<ul><li>1st item of 2nd list</li>
<li>2nd item of 2nd list</li></ul>
<p>Another paragraph</p>
但我找不到插入 "ul" 标签的方法
哇,这是一项繁琐的任务,但我终于设法做到了。我使用 find
函数和过滤函数来查找 table.
<p>
元素
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function
请注意,我已经修复了您发布的 HTML 中格式错误的部分。
from bs4 import BeautifulSoup, Tag
if __name__ == "__main__":
html = '''
<p>Hello world!</p>
<table><tr><td> </td><td>•</td><td><p>First bullet point text</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Second</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Third</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>Last</p></td></tr></table>
<p>Some paragraph</p>
<table><tr><td> </td><td>•</td><td><p>1st item of 2nd list</p></td></tr></table>
<table><tr><td> </td><td>•</td><td><p>2nd item of 2nd list</p></td></tr></table>
<p>Another paragraph</p>
'''
soup = BeautifulSoup(html, 'html.parser')
# find all <p>s under a table and replace table with the <p> element
def p_under_table_extractor(el: Tag):
table_parent = el.find_parent('table')
return el.name == 'p' and table_parent
for p in soup.find_all(p_under_table_extractor):
table_parent = p.find_parent('table')
p.name = 'li'
table_parent.replace_with(p)
# the only <p>s are the root <p>s
for p in soup.find_all('p'):
# find all succeeding <li>s
li_els = []
for el in p.find_all_next():
if el.name != 'li':
break
else:
li_els.append(el)
# put those <li>s inside a <ul>
if li_els:
ul = soup.new_tag('ul')
for li in li_els:
ul.append(li)
# and put <ul> after the <p>
p.insert_after(ul)
print(soup.prettify())
打印:
<p>Hello world!</p>
<ul>
<li>First bullet point text</li>
<li>Second</li>
<li>Third</li>
<li>Last</li>
</ul>
<p>Some paragraph</p>
<ul>
<li>1st item of 2nd list</li>
<li>2nd item of 2nd list</li>
</ul>
<p>Another paragraph</p>