如何使用 BeautifulSoup 提取多个 H2 标签

Question

import requests
from bs4 import BeautifulSoup
import pandas as pd

articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'

r = requests.get(url)
#print(r.status_code)

soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
#print(articles)

for item in articles:
  #h2_headings = item.find('h2').text
  h2_headings = item.find_all('h2')

  article = {
    'H2_Heading': h2_headings,
  }

  print('Added article:', article)
  articlelist.append(article)

df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')

脚本中使用的网页有多个我要抓取的 H2 标题标签。

我正在寻找一种简单地抓取所有 H2 标题文本的方法，如下所示：

ANGRY BIRDS 2、ANGRY BIRDS DREAM BLAST、ANGRY BIRDS FRIENDS、ANGRY BIRDS MATCH、ANGRY BIRDS BLAST、ANGRY BIRDS POP

问题

当我使用语法 h2_headings = item.find('h2').text 时，它会按预期准确显示第一个 h2 标题文本。

但是，我需要捕获 H2 标签的所有实例。当我使用 h2_headings = item.find_all('h2') 时 returns 结果如下：

{'H2_Heading': [<h2>Angry Birds 2</h2>, <h2>Angry Birds Dream Blast</h2>, <h2>Angry Birds Friends</h2>, <h2>Angry Birds Match</h2>, <h2>Angry Birds Blast</h2>, <h2>Angry Birds POP</h2>]}

修改语句为h2_headings = item.find_all('h2').text.strip()returns如下错误：

AttributeError: ResultSet object 没有属性 'text'。您可能将元素列表视为单个元素。你打电话给 find_all() 的时候意味着调用 find()?

如有任何帮助，我们将不胜感激。

Answer 1

关注这个答案How to remove h2 tag from html data using beautifulsoup4?

希望对你有所帮助

for item in articles:
  #h2_headings = item.find('h2').text
  h2_headings = item.find_all('h2')

  for h in h2_headings:
    articlelist.append(h.string)

Answer 2

您可以按如下方式进行：

import requests
from bs4 import BeautifulSoup
import pandas as pd

articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'

r = requests.get(url)
#print(r.status_code)

soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')


for item in articles:
    h2=', '.join([x.get_text() for x in item.find_all('h2')])
    print(h2)
  

#   print('Added article:', article)
#   articlelist.append(article)

# df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')

输出：

Angry Birds 2, Angry Birds Dream Blast, Angry Birds Friends, Angry Birds Match, Angry Birds Blast, Angry Birds POP

如何使用 BeautifulSoup 提取多个 H2 标签

How to Extract Multiple H2 Tags Using BeautifulSoup

python

beautifulsoup

h2

findall

web-scraping

问题