如何使用 Beautiful Soup 查找所有评论
How to find all comments with Beautiful Soup
This question 四年前有人问过,但现在答案对于 BS4 已经过时了。
我想删除我的 html 文件中所有使用美汤的评论。由于 BS4 使每个 comment as a special type of navigable string,我认为这段代码可以工作:
for comments in soup.find_all('comment'):
comments.decompose()
所以那没有用....我如何使用 BS4 查找所有评论?
我需要做两件事:
首先,导入Beautiful Soup时
from bs4 import BeautifulSoup, Comment
其次,这是提取评论的代码
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
您可以将函数传递给 find_all() 以帮助它检查字符串是否为注释。
例如我下面有 html:
<body>
<!-- Branding and main navigation -->
<div class="Branding">The Science & Safety Behind Your Favorite Products</div>
<div class="l-branding">
<p>Just a brand</p>
</div>
<!-- test comment here -->
<div class="block_content">
<a href="https://www.google.com">Google</a>
</div>
</body>
代码:
from bs4 import BeautifulSoup as BS
from bs4 import Comment
....
soup = BS(html, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
print(c)
print("===========")
c.extract()
输出将是:
Branding and main navigation
============
test comment here
============
顺便说一句,我认为 find_all('Comment')
不起作用的原因是(来自 BeautifulSoup 文档):
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.
This question 四年前有人问过,但现在答案对于 BS4 已经过时了。
我想删除我的 html 文件中所有使用美汤的评论。由于 BS4 使每个 comment as a special type of navigable string,我认为这段代码可以工作:
for comments in soup.find_all('comment'):
comments.decompose()
所以那没有用....我如何使用 BS4 查找所有评论?
我需要做两件事:
首先,导入Beautiful Soup时
from bs4 import BeautifulSoup, Comment
其次,这是提取评论的代码
for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
comments.extract()
您可以将函数传递给 find_all() 以帮助它检查字符串是否为注释。
例如我下面有 html:
<body>
<!-- Branding and main navigation -->
<div class="Branding">The Science & Safety Behind Your Favorite Products</div>
<div class="l-branding">
<p>Just a brand</p>
</div>
<!-- test comment here -->
<div class="block_content">
<a href="https://www.google.com">Google</a>
</div>
</body>
代码:
from bs4 import BeautifulSoup as BS
from bs4 import Comment
....
soup = BS(html, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
print(c)
print("===========")
c.extract()
输出将是:
Branding and main navigation
============
test comment here
============
顺便说一句,我认为 find_all('Comment')
不起作用的原因是(来自 BeautifulSoup 文档):
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.