Python 网页抓取 - 两个不同的 Parent Class 名称,不同的结构但相同的 Child Class 名称
Python Web Scraping - Two Different Parent Class Names, Different Structures but same Child Class Names
我正在尝试从 prnewswire.com 抓取新闻文章。每篇文章都存储在一个名为“行”的 div 中。
我的问题是有些文章预览在标题和描述旁边有一张图片。因此,在“行”-classes 下,它是 class 名称“卡片”(带图像)或“col-sm-12 卡片”(无图像):
我当前的代码如下:
import requests
from bs4 import BeautifulSoup
import pandas
headers = {
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
'Version/14.0.1 Safari/605.1.15'
}
articlelist = []
def getarticles(page):
url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
prnewswire_articles = soup.find_all('div', {'class': 'col-sm-12 card'})
for item in prnewswire_articles:
prnewswire_article = {
'page': page,
'article_title': item.find('h3').text,
'article_link': 'https://www.prnewswire.com/' +
item.find('a')['href'],
'article_description': item.find('p').text,
}
articlelist.append(prnewswire_article)
return
for x in range(1, 3):
getarticles(x)
df = pandas.DataFrame(articlelist)
print(df.head())
print(len(df))
df.to_excel('PRNewsWire.xlsx', index=False)
print('Finished.')
我发现了以下内容:在我声明“prnewswire_articles”并查找具有特定 class 名称的 div 的行中,我得到了我想要的结果class “col-sm-12 卡”。但是“卡”或“行”不起作用。
我注意到“card”classes 的 html 结构与“col-sm-12 card”classes 不同,但它们都包含一个“h3”元素(文章标题),一个“a href”和一个“p”元素
这是我在使用“行”或“卡片”作为 class 名称时收到的错误消息:
Traceback (most recent call last):
File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 33, in <module>
getarticles(x)
File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 24, in getarticles
'article_title': item.find('h3').text,
AttributeError: 'NoneType' object has no attribute 'text'
我找了一整天,什么也没找到。最近才开始学习 Python,如果这是一个愚蠢的错误,我很抱歉,但我正在寻找答案。真的很感激帮助很多! :)
您可以 select class .card-list
下的所有 .row
(使用 CSS selector)。我还更改了提取文章标题的方式(只需获取 <small>
元素之后的文本):
import pandas
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
'Version/14.0.1 Safari/605.1.15'
}
articlelist = []
def getarticles(page):
url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
prnewswire_articles = soup.select('.card-list > .row') # <-- select all rows that are under class "card-list"
for item in prnewswire_articles:
prnewswire_article = {
'page': page,
'article_title': item.select_one('h3 small').find_next_sibling(text=True).strip(), # <--- select text that is after <small> element
'article_link': 'https://www.prnewswire.com/' +
item.find('a')['href'],
'article_description': item.find('p').get_text(strip=True, separator='\n'),
}
articlelist.append(prnewswire_article)
for x in range(1, 3):
getarticles(x)
df = pandas.DataFrame(articlelist)
print(df)
打印:
...
186 2 Wugen Announces Exclusive Partnership Agreemen... https://www.prnewswire.com//news-releases/wuge... Wugen Inc., a clinical-stage biotechnology com...
187 2 Upstryve Initiates Mentor Network for Trade St... https://www.prnewswire.com//news-releases/upst... Upstryve Inc (Upstryve) www.upstryve.com. Upst...
188 2 Inkling Simplifies Integration to Learning and... https://www.prnewswire.com//news-releases/inkl... Inkling, a global leader in digital learning p...
189 2 CommerceHub to Participate in Bank of America'... https://www.prnewswire.com//news-releases/comm... CommerceHub, a leading provider of ecommerce s...
190 2 Instrument Promotes Kara Place to President https://www.prnewswire.com//news-releases/inst... Instrument, a digitally focused, creative agen...
191 2 Regent Properties Announces Executive Promotions https://www.prnewswire.com//news-releases/rege... Regent Properties ("Regent"), a real estate in...
192 2 PowerPay Hits Billion in Home Renovations L... https://www.prnewswire.com//news-releases/powe... PowerPay, the nation's fastest-growing home im...
193 2 GoldenTree Announces Closing of 8 Million C... https://www.prnewswire.com//news-releases/gold... GoldenTree Loan Management II ("GLM II") and i...
194 2 The Real-Time Moving Show on the Screen "Showt... https://www.prnewswire.com//news-releases/the-... EnableWow (www.showtap.com) launched a new pre...
195 2 Black Knight: Lock Activity Suggests Q1 2021 R... https://www.prnewswire.com//news-releases/blac... Today, the Data & Analytics division of Black ...
196 2 LG Innotek Joins Hands with Microsoft to Proli... https://www.prnewswire.com//news-releases/lg-i... LG Innotek (CEO Cheoldong Jeong) announced on ...
197 2 MicroWorkers Integrates Ontology's ONTO Wallet... https://www.prnewswire.com//news-releases/micr... To bridge micro workers globally, Ontology and...
198 2 MemVerge Introduces M3 Channel Partner Program https://www.prnewswire.com//news-releases/memv... MemVerge™, the pioneers of Big Memory software...
199 2 Innovative Deals Spur the Growth of New Sports... https://www.prnewswire.com//news-releases/inno... Last year has shaped up to be crucial for the ...
我正在尝试从 prnewswire.com 抓取新闻文章。每篇文章都存储在一个名为“行”的 div 中。
我的问题是有些文章预览在标题和描述旁边有一张图片。因此,在“行”-classes 下,它是 class 名称“卡片”(带图像)或“col-sm-12 卡片”(无图像):
我当前的代码如下:
import requests
from bs4 import BeautifulSoup
import pandas
headers = {
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
'Version/14.0.1 Safari/605.1.15'
}
articlelist = []
def getarticles(page):
url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
prnewswire_articles = soup.find_all('div', {'class': 'col-sm-12 card'})
for item in prnewswire_articles:
prnewswire_article = {
'page': page,
'article_title': item.find('h3').text,
'article_link': 'https://www.prnewswire.com/' +
item.find('a')['href'],
'article_description': item.find('p').text,
}
articlelist.append(prnewswire_article)
return
for x in range(1, 3):
getarticles(x)
df = pandas.DataFrame(articlelist)
print(df.head())
print(len(df))
df.to_excel('PRNewsWire.xlsx', index=False)
print('Finished.')
我发现了以下内容:在我声明“prnewswire_articles”并查找具有特定 class 名称的 div 的行中,我得到了我想要的结果class “col-sm-12 卡”。但是“卡”或“行”不起作用。
我注意到“card”classes 的 html 结构与“col-sm-12 card”classes 不同,但它们都包含一个“h3”元素(文章标题),一个“a href”和一个“p”元素
这是我在使用“行”或“卡片”作为 class 名称时收到的错误消息:
Traceback (most recent call last):
File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 33, in <module>
getarticles(x)
File "/Users/myname/PycharmProjects/projectname/prnewswire.py", line 24, in getarticles
'article_title': item.find('h3').text,
AttributeError: 'NoneType' object has no attribute 'text'
我找了一整天,什么也没找到。最近才开始学习 Python,如果这是一个愚蠢的错误,我很抱歉,但我正在寻找答案。真的很感激帮助很多! :)
您可以 select class .card-list
下的所有 .row
(使用 CSS selector)。我还更改了提取文章标题的方式(只需获取 <small>
元素之后的文本):
import pandas
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
'Version/14.0.1 Safari/605.1.15'
}
articlelist = []
def getarticles(page):
url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
prnewswire_articles = soup.select('.card-list > .row') # <-- select all rows that are under class "card-list"
for item in prnewswire_articles:
prnewswire_article = {
'page': page,
'article_title': item.select_one('h3 small').find_next_sibling(text=True).strip(), # <--- select text that is after <small> element
'article_link': 'https://www.prnewswire.com/' +
item.find('a')['href'],
'article_description': item.find('p').get_text(strip=True, separator='\n'),
}
articlelist.append(prnewswire_article)
for x in range(1, 3):
getarticles(x)
df = pandas.DataFrame(articlelist)
print(df)
打印:
...
186 2 Wugen Announces Exclusive Partnership Agreemen... https://www.prnewswire.com//news-releases/wuge... Wugen Inc., a clinical-stage biotechnology com...
187 2 Upstryve Initiates Mentor Network for Trade St... https://www.prnewswire.com//news-releases/upst... Upstryve Inc (Upstryve) www.upstryve.com. Upst...
188 2 Inkling Simplifies Integration to Learning and... https://www.prnewswire.com//news-releases/inkl... Inkling, a global leader in digital learning p...
189 2 CommerceHub to Participate in Bank of America'... https://www.prnewswire.com//news-releases/comm... CommerceHub, a leading provider of ecommerce s...
190 2 Instrument Promotes Kara Place to President https://www.prnewswire.com//news-releases/inst... Instrument, a digitally focused, creative agen...
191 2 Regent Properties Announces Executive Promotions https://www.prnewswire.com//news-releases/rege... Regent Properties ("Regent"), a real estate in...
192 2 PowerPay Hits Billion in Home Renovations L... https://www.prnewswire.com//news-releases/powe... PowerPay, the nation's fastest-growing home im...
193 2 GoldenTree Announces Closing of 8 Million C... https://www.prnewswire.com//news-releases/gold... GoldenTree Loan Management II ("GLM II") and i...
194 2 The Real-Time Moving Show on the Screen "Showt... https://www.prnewswire.com//news-releases/the-... EnableWow (www.showtap.com) launched a new pre...
195 2 Black Knight: Lock Activity Suggests Q1 2021 R... https://www.prnewswire.com//news-releases/blac... Today, the Data & Analytics division of Black ...
196 2 LG Innotek Joins Hands with Microsoft to Proli... https://www.prnewswire.com//news-releases/lg-i... LG Innotek (CEO Cheoldong Jeong) announced on ...
197 2 MicroWorkers Integrates Ontology's ONTO Wallet... https://www.prnewswire.com//news-releases/micr... To bridge micro workers globally, Ontology and...
198 2 MemVerge Introduces M3 Channel Partner Program https://www.prnewswire.com//news-releases/memv... MemVerge™, the pioneers of Big Memory software...
199 2 Innovative Deals Spur the Growth of New Sports... https://www.prnewswire.com//news-releases/inno... Last year has shaped up to be crucial for the ...