用漂亮的汤抓取所有 h1 标签内容

Question

我正在尝试用美汤抓取一些评论数据，它只会让我抓取一个元素：

BASE_URL = "http://consequenceofsound.net/'category/reviews/album-reviews/"
html = urlopen(BASE_URL + section_url).read()
soup = BeautifulSoup(html, "lxml")
meta = soup.find("div", {"class": "content"}).h1
wordage = [s.contents for s in meta]

这样我就可以从此页面中获取单个评论标题。但是，当我将 find 更改为 find_all 时，我无法在此行上识别 h1，所以我得到了一些这样的代码：

meta = soup.find("div", {"class": "content"})
wordage = [s.h1 for s in meta]

我找不到隔离内容的方法。

Answer 1

meta = soup.find_all("div", {"class": "content"})

wordage = [s.h1 for s in meta if s.h1 not in ([], None)]
link = [s.a['href'] for s in wordage]

注意 'not in' 语句的添加。似乎有时空列表和非类型列表会添加到 'soup' 中，所以这是一个重要的措施。

用漂亮的汤抓取所有 h1 标签内容

Scraping all h1 tags contents with beautiful soup

python

nlp

screen-scraping

beautifulsoup

web-scraping