如何使用 Beautiful Soup 查找所有评论

How to find all comments with Beautiful Soup

This question 四年前有人问过,但现在答案对于 BS4 已经过时了。

我想删除我的 html 文件中所有使用美汤的评论。由于 BS4 使每个 comment as a special type of navigable string,我认为这段代码可以工作:

for comments in soup.find_all('comment'):
     comments.decompose()

所以那没有用....我如何使用 BS4 查找所有评论?

我需要做两件事:

首先,导入Beautiful Soup时

from bs4 import BeautifulSoup, Comment

其次,这是提取评论的代码

for comments in soup.findAll(text=lambda text:isinstance(text, Comment)):
    comments.extract()

您可以将函数传递给 find_all() 以帮助它检查字符串是否为注释。

例如我下面有 html:

<body>
   <!-- Branding and main navigation -->
   <div class="Branding">The Science &amp; Safety Behind Your Favorite Products</div>
   <div class="l-branding">
      <p>Just a brand</p>
   </div>
   <!-- test comment here -->
   <div class="block_content">
      <a href="https://www.google.com">Google</a>
   </div>
</body>

代码:

from bs4 import BeautifulSoup as BS
from bs4 import Comment
....
soup = BS(html, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
    print(c)
    print("===========")
    c.extract()

输出将是:

Branding and main navigation 
============
test comment here
============

顺便说一句,我认为 find_all('Comment') 不起作用的原因是(来自 BeautifulSoup 文档):

Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.