Google 美汤新闻标题标签
Tag of Google news title for beautiful soup
我正在尝试从 Google 新闻(例如疫苗)中提取搜索结果,并根据收集到的标题提供一些情绪分析。
到目前为止,我似乎找不到正确的标签来收集头条新闻。
这是我的代码:
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
headline_results = soup.find_all('div', class_="phYMDf nDgy9d")
for h in headline_results:
blob = TextBlob(h.get_text())
self.sentiment += blob.sentiment.polarity / len(headline_results)
self.subjectivity += blob.sentiment.subjectivity / len(headline_results)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
情感的结果始终为 0,主观性的结果始终为 0。我觉得问题出在 class_="phYMDf nDgy9d".
如果您浏览到 link,您将看到页面的完成状态,但 requests.get
除了您请求的页面外,不会执行或加载任何其他数据。幸运的是有一些数据,你可以抓取它。我建议您使用 html 美化服务,例如 codebeautify 以更好地了解页面结构。
此外,如果您看到 class 类似 phYMDf nDgy9d
的内容,请务必避免与他们一起查找。它们是 classes 的缩小版本,因此在任何时候,如果它们更改了 CSS 代码的一部分,您正在寻找的 class 将获得一个新名称。
我所做的可能有点矫枉过正,但我设法深入挖掘了特定部分,你的代码现在可以工作了。
当您查看请求的 html 文件的更漂亮版本时,必要的内容位于上面显示的 ID 为 main
的 div 中。然后是 children 以 div 元素开始 Google 搜索,继续 style
元素,在一个空的 div 元素之后,有 post div个元素。 children 列表中的最后两个元素是 footer
和 script
元素。我们可以用 [3:-2]
切断这些,然后在那棵树下我们有纯数据(几乎)。如果你查看posts
变量之后的剩余部分代码,我认为你可以理解它。
代码如下:
from textblob import TextBlob
import requests, re
from bs4 import BeautifulSoup
from pprint import pprint
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
#print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
mainDiv = soup.find("div", {"id": "main"})
posts = [i for i in mainDiv.children][3:-2]
news = []
for post in posts:
reg = re.compile(r"^/url.*")
cursor = post.findAll("a", {"href": reg})
postData = {}
postData["headline"] = cursor[0].find("div").get_text()
postData["source"] = cursor[0].findAll("div")[1].get_text()
postData["timeAgo"] = cursor[1].next_sibling.find("span").get_text()
postData["description"] = cursor[1].next_sibling.find("span").parent.get_text().split("· ")[1]
news.append(postData)
pprint(news)
for h in news:
blob = TextBlob(h["headline"] + " "+ h["description"])
self.sentiment += blob.sentiment.polarity / len(news)
self.subjectivity += blob.sentiment.subjectivity / len(news)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
几个输出:
[{'description': 'It comes after US health officials said last week they had '
'started a trial to evaluate a possible vaccine in Seattle. '
'The Chinese effort began on...',
'headline': 'China embarks on clinical trial for virus vaccine',
'source': 'The Star Online',
'timeAgo': '5 saat önce'},
{'description': 'Hanneke Schuitemaker, who is leading a team working on a '
'Covid-19 vaccine, tells of the latest developments and what '
'needs to be done now.',
'headline': 'Vaccine scientist: ‘Everything is so new in dealing with this '
'coronavirus’',
'source': 'The Guardian',
'timeAgo': '20 saat önce'},
.
.
.
Vaccine Subjectivity: 0.34522727272727277 Sentiment: 0.14404040404040402
[{'description': '10 Cool Tech Gadgets To Survive Working From Home. From '
'Wi-Fi and cell phone signal boosters, to noise-cancelling '
'headphones and gadgets...',
'headline': '10 Cool Tech Gadgets To Survive Working From Home',
'source': 'CRN',
'timeAgo': '2 gün önce'},
{'description': 'Over the past few years, smart home products have dominated '
'the gadget space, with goods ranging from innovative updates '
'to the items we...',
'headline': '6 Smart Home Gadgets That Are Actually Worth Owning',
'source': 'Entrepreneur',
'timeAgo': '2 hafta önce'},
.
.
.
Home Gadgets Subjectivity: 0.48007305194805205 Sentiment: 0.3114683441558441
我使用了标题和描述数据来进行操作,但如果您愿意,您可以尝试一下。你现在有了数据:)
使用这个
headline_results = soup.find_all('div', {'class' : 'BNeawe vvjwJb AP7Wnd'})
您已经打印了response.text,如果您想查找具体数据,请从response.text结果中搜索
请尝试使用 select()
。 CSS
选择器更加灵活。 CSS
个选择器 reference.
查看 SelectorGadget Chrome 扩展以获取 CSS
选择器,方法是在浏览器中单击所需的元素。
如果你想获得所有的称号等等,那么你正在寻找这个容器:
soup.select('.dbsr')
确保传递 user-agent
,因为 Google 最终可能会阻止您的请求,您将收到不同的 HTML,因此输出为空。 Check what is your user-agent
通过 user-agent
:
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
我不确定你到底想做什么,但是 的解决方案有点 overkill 正如他提到的那样,切片 regex
,在 div#main
做某事。就简单多了。
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search?q={self.term}&tbm=nws"
def run (self):
response = requests.get(self.url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
news_data = []
for result in soup.select('.dbsr'):
title = result.select_one('.nDgy9d').text
link = result.a['href']
source = result.select_one('.WF4CUc').text
snippet = result.select_one('.Y3v8qd').text
date_published = result.select_one('.WG9SHc span').text
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Lasagna")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.3255952380952381 Sentiment: 0.05113636363636363
# Lasagna Subjectivity: 0.36556818181818185 Sentiment: 0.25386093073593075
或者,您可以使用 SerpApi 中的 Google News Results API 来实现相同的目的。这是付费 API 和免费计划。
你的情况的不同之处在于你不必维护解析器,弄清楚如何解析某些元素或弄清楚为什么某些东西不能正常工作,并了解如何绕过来自 Google。所有需要做的就是迭代结构化 JSON 并快速获得你想要的东西。
与您的示例集成的代码:
from textblob import TextBlob
import os
from serpapi import GoogleSearch
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search"
def run (self):
params = {
"engine": "google",
"tbm": "nws",
"q": self.url,
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
news_data = []
for result in results['news_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
source = result['source']
date_published = result['date']
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Vaccine")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
# Lasagna Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
P.S - 我写了一篇更详细的博客 post 关于如何抓取 Google News.
Disclaimer, I work for SerpApi.
我正在尝试从 Google 新闻(例如疫苗)中提取搜索结果,并根据收集到的标题提供一些情绪分析。
到目前为止,我似乎找不到正确的标签来收集头条新闻。
这是我的代码:
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
headline_results = soup.find_all('div', class_="phYMDf nDgy9d")
for h in headline_results:
blob = TextBlob(h.get_text())
self.sentiment += blob.sentiment.polarity / len(headline_results)
self.subjectivity += blob.sentiment.subjectivity / len(headline_results)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
情感的结果始终为 0,主观性的结果始终为 0。我觉得问题出在 class_="phYMDf nDgy9d".
如果您浏览到 link,您将看到页面的完成状态,但 requests.get
除了您请求的页面外,不会执行或加载任何其他数据。幸运的是有一些数据,你可以抓取它。我建议您使用 html 美化服务,例如 codebeautify 以更好地了解页面结构。
此外,如果您看到 class 类似 phYMDf nDgy9d
的内容,请务必避免与他们一起查找。它们是 classes 的缩小版本,因此在任何时候,如果它们更改了 CSS 代码的一部分,您正在寻找的 class 将获得一个新名称。
我所做的可能有点矫枉过正,但我设法深入挖掘了特定部分,你的代码现在可以工作了。
当您查看请求的 html 文件的更漂亮版本时,必要的内容位于上面显示的 ID 为 main
的 div 中。然后是 children 以 div 元素开始 Google 搜索,继续 style
元素,在一个空的 div 元素之后,有 post div个元素。 children 列表中的最后两个元素是 footer
和 script
元素。我们可以用 [3:-2]
切断这些,然后在那棵树下我们有纯数据(几乎)。如果你查看posts
变量之后的剩余部分代码,我认为你可以理解它。
代码如下:
from textblob import TextBlob
import requests, re
from bs4 import BeautifulSoup
from pprint import pprint
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
#print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
mainDiv = soup.find("div", {"id": "main"})
posts = [i for i in mainDiv.children][3:-2]
news = []
for post in posts:
reg = re.compile(r"^/url.*")
cursor = post.findAll("a", {"href": reg})
postData = {}
postData["headline"] = cursor[0].find("div").get_text()
postData["source"] = cursor[0].findAll("div")[1].get_text()
postData["timeAgo"] = cursor[1].next_sibling.find("span").get_text()
postData["description"] = cursor[1].next_sibling.find("span").parent.get_text().split("· ")[1]
news.append(postData)
pprint(news)
for h in news:
blob = TextBlob(h["headline"] + " "+ h["description"])
self.sentiment += blob.sentiment.polarity / len(news)
self.subjectivity += blob.sentiment.subjectivity / len(news)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
几个输出:
[{'description': 'It comes after US health officials said last week they had '
'started a trial to evaluate a possible vaccine in Seattle. '
'The Chinese effort began on...',
'headline': 'China embarks on clinical trial for virus vaccine',
'source': 'The Star Online',
'timeAgo': '5 saat önce'},
{'description': 'Hanneke Schuitemaker, who is leading a team working on a '
'Covid-19 vaccine, tells of the latest developments and what '
'needs to be done now.',
'headline': 'Vaccine scientist: ‘Everything is so new in dealing with this '
'coronavirus’',
'source': 'The Guardian',
'timeAgo': '20 saat önce'},
.
.
.
Vaccine Subjectivity: 0.34522727272727277 Sentiment: 0.14404040404040402
[{'description': '10 Cool Tech Gadgets To Survive Working From Home. From '
'Wi-Fi and cell phone signal boosters, to noise-cancelling '
'headphones and gadgets...',
'headline': '10 Cool Tech Gadgets To Survive Working From Home',
'source': 'CRN',
'timeAgo': '2 gün önce'},
{'description': 'Over the past few years, smart home products have dominated '
'the gadget space, with goods ranging from innovative updates '
'to the items we...',
'headline': '6 Smart Home Gadgets That Are Actually Worth Owning',
'source': 'Entrepreneur',
'timeAgo': '2 hafta önce'},
.
.
.
Home Gadgets Subjectivity: 0.48007305194805205 Sentiment: 0.3114683441558441
我使用了标题和描述数据来进行操作,但如果您愿意,您可以尝试一下。你现在有了数据:)
使用这个
headline_results = soup.find_all('div', {'class' : 'BNeawe vvjwJb AP7Wnd'})
您已经打印了response.text,如果您想查找具体数据,请从response.text结果中搜索
请尝试使用 select()
。 CSS
选择器更加灵活。 CSS
个选择器 reference.
查看 SelectorGadget Chrome 扩展以获取 CSS
选择器,方法是在浏览器中单击所需的元素。
如果你想获得所有的称号等等,那么你正在寻找这个容器:
soup.select('.dbsr')
确保传递 user-agent
,因为 Google 最终可能会阻止您的请求,您将收到不同的 HTML,因此输出为空。 Check what is your user-agent
通过 user-agent
:
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
我不确定你到底想做什么,但是 regex
,在 div#main
做某事。就简单多了。
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search?q={self.term}&tbm=nws"
def run (self):
response = requests.get(self.url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
news_data = []
for result in soup.select('.dbsr'):
title = result.select_one('.nDgy9d').text
link = result.a['href']
source = result.select_one('.WF4CUc').text
snippet = result.select_one('.Y3v8qd').text
date_published = result.select_one('.WG9SHc span').text
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Lasagna")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.3255952380952381 Sentiment: 0.05113636363636363
# Lasagna Subjectivity: 0.36556818181818185 Sentiment: 0.25386093073593075
或者,您可以使用 SerpApi 中的 Google News Results API 来实现相同的目的。这是付费 API 和免费计划。
你的情况的不同之处在于你不必维护解析器,弄清楚如何解析某些元素或弄清楚为什么某些东西不能正常工作,并了解如何绕过来自 Google。所有需要做的就是迭代结构化 JSON 并快速获得你想要的东西。
与您的示例集成的代码:
from textblob import TextBlob
import os
from serpapi import GoogleSearch
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search"
def run (self):
params = {
"engine": "google",
"tbm": "nws",
"q": self.url,
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
news_data = []
for result in results['news_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
source = result['source']
date_published = result['date']
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Vaccine")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
# Lasagna Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
P.S - 我写了一篇更详细的博客 post 关于如何抓取 Google News.
Disclaimer, I work for SerpApi.