如何使用 BeautifulSoup 提取多个 H2 标签
How to Extract Multiple H2 Tags Using BeautifulSoup
import requests
from bs4 import BeautifulSoup
import pandas as pd
articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'
r = requests.get(url)
#print(r.status_code)
soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
#print(articles)
for item in articles:
#h2_headings = item.find('h2').text
h2_headings = item.find_all('h2')
article = {
'H2_Heading': h2_headings,
}
print('Added article:', article)
articlelist.append(article)
df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')
脚本中使用的网页有多个我要抓取的 H2 标题标签。
我正在寻找一种简单地抓取所有 H2 标题文本的方法,如下所示:
ANGRY BIRDS 2、ANGRY BIRDS DREAM BLAST、ANGRY BIRDS FRIENDS、ANGRY BIRDS MATCH、ANGRY BIRDS BLAST、ANGRY BIRDS POP
问题
当我使用语法 h2_headings = item.find('h2').text
时,它会按预期准确显示第一个 h2 标题文本。
但是,我需要捕获 H2 标签的所有实例。当我使用 h2_headings = item.find_all('h2')
时 returns 结果如下:
{'H2_Heading': [<h2>Angry Birds 2</h2>, <h2>Angry Birds Dream Blast</h2>, <h2>Angry Birds Friends</h2>, <h2>Angry Birds Match</h2>, <h2>Angry Birds Blast</h2>, <h2>Angry Birds POP</h2>]}
修改语句为h2_headings = item.find_all('h2').text.strip()
returns如下错误:
AttributeError: ResultSet object 没有属性 'text'。您可能将元素列表视为单个元素。你打电话给 find_all() 的时候
意味着调用 find()?
如有任何帮助,我们将不胜感激。
关注这个答案How to remove h2 tag from html data using beautifulsoup4?
希望对你有所帮助
for item in articles:
#h2_headings = item.find('h2').text
h2_headings = item.find_all('h2')
for h in h2_headings:
articlelist.append(h.string)
您可以按如下方式进行:
import requests
from bs4 import BeautifulSoup
import pandas as pd
articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'
r = requests.get(url)
#print(r.status_code)
soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
for item in articles:
h2=', '.join([x.get_text() for x in item.find_all('h2')])
print(h2)
# print('Added article:', article)
# articlelist.append(article)
# df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')
输出:
Angry Birds 2, Angry Birds Dream Blast, Angry Birds Friends, Angry Birds Match, Angry Birds Blast, Angry Birds POP
import requests
from bs4 import BeautifulSoup
import pandas as pd
articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'
r = requests.get(url)
#print(r.status_code)
soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
#print(articles)
for item in articles:
#h2_headings = item.find('h2').text
h2_headings = item.find_all('h2')
article = {
'H2_Heading': h2_headings,
}
print('Added article:', article)
articlelist.append(article)
df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')
脚本中使用的网页有多个我要抓取的 H2 标题标签。
我正在寻找一种简单地抓取所有 H2 标题文本的方法,如下所示:
ANGRY BIRDS 2、ANGRY BIRDS DREAM BLAST、ANGRY BIRDS FRIENDS、ANGRY BIRDS MATCH、ANGRY BIRDS BLAST、ANGRY BIRDS POP
问题
当我使用语法 h2_headings = item.find('h2').text
时,它会按预期准确显示第一个 h2 标题文本。
但是,我需要捕获 H2 标签的所有实例。当我使用 h2_headings = item.find_all('h2')
时 returns 结果如下:
{'H2_Heading': [<h2>Angry Birds 2</h2>, <h2>Angry Birds Dream Blast</h2>, <h2>Angry Birds Friends</h2>, <h2>Angry Birds Match</h2>, <h2>Angry Birds Blast</h2>, <h2>Angry Birds POP</h2>]}
修改语句为h2_headings = item.find_all('h2').text.strip()
returns如下错误:
AttributeError: ResultSet object 没有属性 'text'。您可能将元素列表视为单个元素。你打电话给 find_all() 的时候 意味着调用 find()?
如有任何帮助,我们将不胜感激。
关注这个答案How to remove h2 tag from html data using beautifulsoup4?
希望对你有所帮助
for item in articles:
#h2_headings = item.find('h2').text
h2_headings = item.find_all('h2')
for h in h2_headings:
articlelist.append(h.string)
您可以按如下方式进行:
import requests
from bs4 import BeautifulSoup
import pandas as pd
articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'
r = requests.get(url)
#print(r.status_code)
soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
for item in articles:
h2=', '.join([x.get_text() for x in item.find_all('h2')])
print(h2)
# print('Added article:', article)
# articlelist.append(article)
# df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')
输出:
Angry Birds 2, Angry Birds Dream Blast, Angry Birds Friends, Angry Birds Match, Angry Birds Blast, Angry Birds POP