如何从 HTML 标签中删除评论

Question

我正在尝试从特定的 HTML 罐头中抓取评论，但我运行遇到了问题。我可以毫无问题地抓取标签下的所有文本，但只有评论。谁能帮帮我。

这是我的代码

from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError, URLError

page=urlopen('https://catalog.data.gov/dataset')
soup=BeautifulSoup(page,'lxml')

dataset_number=soup.select('div .new-results')
print(dataset_number)

我想从上述代码返回的数据中提取HTML评论。

Answer 1

试试这个：

from bs4 import BeautifulSoup,Comment
from urllib.request import urlopen
page=urlopen('https://catalog.data.gov/dataset')
soup=BeautifulSoup(page,'lxml')
dataset_number=soup.select('div .new-results')[0]
for com in dataset_number(text=lambda text: isinstance(text, Comment)):
    print(com)

Answer 2

我有一个案例，我用方便的正则表达式做到了这一点。

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

page=urlopen('https://catalog.data.gov/dataset')
soup=BeautifulSoup(page,'lxml')

dataset_number=soup.select('div .new-results')
result = re.findall('<!--(.*)-->', str(dataset_number))
print(result)

如何从 HTML 标签中删除评论

How can I scrape the comment off a HTML tag

python

urllib

beautifulsoup

css-selectors

web-scraping