Python html 使用 bs4 解析 div 数据
Python html parsing of div data using bs4
现在我想删除 html 页面的页眉和页脚。我发现页眉和页脚显示为每个 div 的最后两行。谁能告诉我如何从 div 中提取除最后两行以外的所有数据,如下所示:
<div class="page"><p />
<p></p>
<p>First line required
</p>
<p>Second line required
</p>
<p>Third line required
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
</p>
<p></p>
</div>
<div class="page"><p />
<p>line required 1
</p>
<p></p>
<p>line required 2
</p>
<p>line required 3
</p>
<p></p>
<p>line required 4
</p>
<p>line required 5
</p>
<p>line required 6
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
<p />
</div>
现有代码如下:
soup = BeautifulSoup(file_content, 'html.parser')
for num, page in enumerate(soup.select('.page'), 1):
content = page.get_text(strip=True, separator=' ').replace("\n", " ")
#import packages
from bs4 import BeautifulSoup
with open('test.html', 'r') as f:
file_content = f.read()
soup = BeautifulSoup(file_content, 'html.parser')
for page in soup.find_all("div", class_="page"):
page.contents[-3].extract()
page.contents[-1].extract()
print(soup.prettify())
好像达到了预期的效果
备注:
- test.html 是您的 html 样本
- 我不得不删除行 -1 和 -3,这可能与你所拥有的奇怪的 html 有关(
<p>Line 2 not required
永远不会结束,而 <p />
标签似乎不会是个好主意:Should I use the <p /> tag in markup?)
此致,
更新的答案:
from bs4 import BeautifulSoup
html_str = """<div class="page"><p />
<p></p>
<p>First line required
</p>
<p>Second line required
</p>
<p>Third line required
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
</p>
<p></p>
</div>
<div class="page"><p />
<p>line required 1
</p>
<p></p>
<p>line required 2
</p>
<p>line required 3
</p>
<p></p>
<p>line required 4
</p>
<p>line required 5
</p>
<p>line required 6
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
<p />
</div>"""
#Load the html string into bs4 object
soup = BeautifulSoup(html_str, 'lxml')
#Strip off empty tags. This also removes empty <p> tags
[x.decompose() for x in soup.findAll(lambda tag: not tag.contents and not tag.name == 'br' )]
#Load all divs with classname = 'page'
items = soup.find_all('',{'class':'page'})
final_html=''
#This for loop removes the last 2 tags from every div (as requested)
for item in items:
last_item = str(item.find_all('p')[-1])
second_last_item = str(item.find_all('p')[-2])
current_item = str(item)
current_item = current_item.replace(last_item,'')
current_item = current_item.replace(second_last_item,'')
final_html = final_html + current_item
final_soup = BeautifulSoup(final_html)
final_str = final_soup.text
print(final_str)
输出:
print(final_str)
--------------------------------
First line required
Second line required
Third line required
line required 1
line required 2
line required 3
line required 4
line required 5
line required 6
现在我想删除 html 页面的页眉和页脚。我发现页眉和页脚显示为每个 div 的最后两行。谁能告诉我如何从 div 中提取除最后两行以外的所有数据,如下所示:
<div class="page"><p />
<p></p>
<p>First line required
</p>
<p>Second line required
</p>
<p>Third line required
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
</p>
<p></p>
</div>
<div class="page"><p />
<p>line required 1
</p>
<p></p>
<p>line required 2
</p>
<p>line required 3
</p>
<p></p>
<p>line required 4
</p>
<p>line required 5
</p>
<p>line required 6
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
<p />
</div>
现有代码如下:
soup = BeautifulSoup(file_content, 'html.parser')
for num, page in enumerate(soup.select('.page'), 1):
content = page.get_text(strip=True, separator=' ').replace("\n", " ")
#import packages
from bs4 import BeautifulSoup
with open('test.html', 'r') as f:
file_content = f.read()
soup = BeautifulSoup(file_content, 'html.parser')
for page in soup.find_all("div", class_="page"):
page.contents[-3].extract()
page.contents[-1].extract()
print(soup.prettify())
好像达到了预期的效果
备注:
- test.html 是您的 html 样本
- 我不得不删除行 -1 和 -3,这可能与你所拥有的奇怪的 html 有关(
<p>Line 2 not required
永远不会结束,而<p />
标签似乎不会是个好主意:Should I use the <p /> tag in markup?)
此致,
更新的答案:
from bs4 import BeautifulSoup
html_str = """<div class="page"><p />
<p></p>
<p>First line required
</p>
<p>Second line required
</p>
<p>Third line required
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
</p>
<p></p>
</div>
<div class="page"><p />
<p>line required 1
</p>
<p></p>
<p>line required 2
</p>
<p>line required 3
</p>
<p></p>
<p>line required 4
</p>
<p>line required 5
</p>
<p>line required 6
</p>
<p>Line 1 not required
</p>
<p>Line 2 not required
<p />
</div>"""
#Load the html string into bs4 object
soup = BeautifulSoup(html_str, 'lxml')
#Strip off empty tags. This also removes empty <p> tags
[x.decompose() for x in soup.findAll(lambda tag: not tag.contents and not tag.name == 'br' )]
#Load all divs with classname = 'page'
items = soup.find_all('',{'class':'page'})
final_html=''
#This for loop removes the last 2 tags from every div (as requested)
for item in items:
last_item = str(item.find_all('p')[-1])
second_last_item = str(item.find_all('p')[-2])
current_item = str(item)
current_item = current_item.replace(last_item,'')
current_item = current_item.replace(second_last_item,'')
final_html = final_html + current_item
final_soup = BeautifulSoup(final_html)
final_str = final_soup.text
print(final_str)
输出:
print(final_str)
--------------------------------
First line required
Second line required
Third line required
line required 1
line required 2
line required 3
line required 4
line required 5
line required 6