在 HTML 文档中保留一个嵌套的 div 并清除所有其他文档
Keep one nested div in HTML doc and clear all others
我正试图从嘈杂的、深度嵌套的 HTML 文档中删除这些杂物。我想保留页面的结构,只是清除周围div
s.
的内容
结构是这样的:
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>
我想删除我想要的 div
之外的所有内容,但保留 div
范围内的所有内容。这是我试过的代码:
for div in soup.find_all("div"):
if div.has_attr('class'):
if div['class'] == "my_class_of_interest":
continue
div.clear()
但这消除了我的 div
兴趣,我怀疑是因为我正在清理它的父级并且清理一直向下。有没有办法在不删除嵌套的 div
的情况下清除 div
的文本?或者有更好的方法吗?
希望我理解你的问题。此脚本将清除相关标签周围的所有字符串:
from bs4 import BeautifulSoup, Tag
txt = '''
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
# print soup before clearing
print(soup)
def clear(tag):
for c in tag.contents:
if isinstance(c, Tag) and c.name == 'div' and 'my_class_of_interest' in c.get('class', []):
continue
elif isinstance(c, Tag):
clear(c)
else:
c.replace_with('')
clear(soup.select_one('div.a'))
print('-' * 80)
# print soup after clearing:
print(soup.prettify())
打印:
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>
--------------------------------------------------------------------------------
<div class="a">
<div>
<div class="my_class_of_interest">
....several levels deeper...
</div>
</div>
</div>
另一种选择,使用 lxml:
import lxml.html as lh
interest = """your html above"""
doc = lh.fromstring(interest)
retain = ''
for d in doc.xpath('//*'):
if d.attrib and d.attrib.values()[0]=="my_class_of_interest":
retain += d.text
d.text =""
d.tail=""
for target in doc.xpath('//div[@class="my_class_of_interest"]'):
target.text=retain
print(lh.tostring(doc).decode())
输出:
<div class="a"><div><div class="my_class_of_interest">
....several levels deeper...
</div></div></div>
我正试图从嘈杂的、深度嵌套的 HTML 文档中删除这些杂物。我想保留页面的结构,只是清除周围div
s.
结构是这样的:
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>
我想删除我想要的 div
之外的所有内容,但保留 div
范围内的所有内容。这是我试过的代码:
for div in soup.find_all("div"):
if div.has_attr('class'):
if div['class'] == "my_class_of_interest":
continue
div.clear()
但这消除了我的 div
兴趣,我怀疑是因为我正在清理它的父级并且清理一直向下。有没有办法在不删除嵌套的 div
的情况下清除 div
的文本?或者有更好的方法吗?
希望我理解你的问题。此脚本将清除相关标签周围的所有字符串:
from bs4 import BeautifulSoup, Tag
txt = '''
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
# print soup before clearing
print(soup)
def clear(tag):
for c in tag.contents:
if isinstance(c, Tag) and c.name == 'div' and 'my_class_of_interest' in c.get('class', []):
continue
elif isinstance(c, Tag):
clear(c)
else:
c.replace_with('')
clear(soup.select_one('div.a'))
print('-' * 80)
# print soup after clearing:
print(soup.prettify())
打印:
<div class="a">
...stuff...
<div>
...stuff....
<div class="my_class_of_interest">
....several levels deeper...
</div>
..stuff..
</div>
...stuff..
</div>
--------------------------------------------------------------------------------
<div class="a">
<div>
<div class="my_class_of_interest">
....several levels deeper...
</div>
</div>
</div>
另一种选择,使用 lxml:
import lxml.html as lh
interest = """your html above"""
doc = lh.fromstring(interest)
retain = ''
for d in doc.xpath('//*'):
if d.attrib and d.attrib.values()[0]=="my_class_of_interest":
retain += d.text
d.text =""
d.tail=""
for target in doc.xpath('//div[@class="my_class_of_interest"]'):
target.text=retain
print(lh.tostring(doc).decode())
输出:
<div class="a"><div><div class="my_class_of_interest">
....several levels deeper...
</div></div></div>