如何使用 BeautifulSoup 提取多个 H2 标签

How to Extract Multiple H2 Tags Using BeautifulSoup

import requests
from bs4 import BeautifulSoup
import pandas as pd

articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'

r = requests.get(url)
#print(r.status_code)

soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
#print(articles)

for item in articles:
  #h2_headings = item.find('h2').text
  h2_headings = item.find_all('h2')

  article = {
    'H2_Heading': h2_headings,
  }

  print('Added article:', article)
  articlelist.append(article)

df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')

脚本中使用的网页有多个我要抓取的 H2 标题标签。

我正在寻找一种简单地抓取所有 H2 标题文本的方法,如下所示:

ANGRY BIRDS 2、ANGRY BIRDS DREAM BLAST、ANGRY BIRDS FRIENDS、ANGRY BIRDS MATCH、ANGRY BIRDS BLAST、ANGRY BIRDS POP

问题

当我使用语法 h2_headings = item.find('h2').text 时,它会按预期准确显示第一个 h2 标题文本。

但是,我需要捕获 H2 标签的所有实例。当我使用 h2_headings = item.find_all('h2') 时 returns 结果如下:

{'H2_Heading': [<h2>Angry Birds 2</h2>, <h2>Angry Birds Dream Blast</h2>, <h2>Angry Birds Friends</h2>, <h2>Angry Birds Match</h2>, <h2>Angry Birds Blast</h2>, <h2>Angry Birds POP</h2>]}

修改语句为h2_headings = item.find_all('h2').text.strip()returns如下错误:

AttributeError: ResultSet object 没有属性 'text'。您可能将元素列表视为单个元素。你打电话给 find_all() 的时候 意味着调用 find()?

如有任何帮助,我们将不胜感激。

关注这个答案How to remove h2 tag from html data using beautifulsoup4?

希望对你有所帮助

for item in articles:
  #h2_headings = item.find('h2').text
  h2_headings = item.find_all('h2')

  for h in h2_headings:
    articlelist.append(h.string)

您可以按如下方式进行:

import requests
from bs4 import BeautifulSoup
import pandas as pd

articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'

r = requests.get(url)
#print(r.status_code)

soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')


for item in articles:
    h2=', '.join([x.get_text() for x in item.find_all('h2')])
    print(h2)
  

#   print('Added article:', article)
#   articlelist.append(article)

# df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')

输出:

Angry Birds 2, Angry Birds Dream Blast, Angry Birds Friends, Angry Birds Match, Angry Birds Blast, Angry Birds POP