Google 美汤新闻标题标签

Tag of Google news title for beautiful soup

我正在尝试从 Google 新闻(例如疫苗)中提取搜索结果,并根据收集到的标题提供一些情绪分析。

到目前为止,我似乎找不到正确的标签来收集头条新闻。

这是我的代码:

from textblob import TextBlob
import requests
from bs4 import BeautifulSoup

class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)

    def run (self):
        response = requests.get(self.url)
        print(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')
        headline_results = soup.find_all('div', class_="phYMDf nDgy9d")
        for h in headline_results:
            blob = TextBlob(h.get_text())
            self.sentiment += blob.sentiment.polarity / len(headline_results)
            self.subjectivity += blob.sentiment.subjectivity / len(headline_results)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)

情感的结果始终为 0,主观性的结果始终为 0。我觉得问题出在 class_="phYMDf nDgy9d".

如果您浏览到 link,您将看到页面的完成状态,但 requests.get 除了您请求的页面外,不会执行或加载任何其他数据。幸运的是有一些数据,你可以抓取它。我建议您使用 html 美化服务,例如 codebeautify 以更好地了解页面结构。

此外,如果您看到 class 类似 phYMDf nDgy9d 的内容,请务必避免与他们一起查找。它们是 classes 的缩小版本,因此在任何时候,如果它们更改了 CSS 代码的一部分,您正在寻找的 class 将获得一个新名称。

我所做的可能有点矫枉过正,但我​​设法深入挖掘了特定部分,你的代码现在可以工作了。

当您查看请求的 html 文件的更漂亮版本时,必要的内容位于上面显示的 ID 为 main 的 div 中。然后是 children 以 div 元素开始 Google 搜索,继续 style 元素,在一个空的 div 元素之后,有 post div个元素。 children 列表中的最后两个元素是 footerscript 元素。我们可以用 [3:-2] 切断这些,然后在那棵树下我们有纯数据(几乎)。如果你查看posts变量之后的剩余部分代码,我认为你可以理解它。

代码如下:

from textblob import TextBlob
import requests, re
from bs4 import BeautifulSoup
from pprint import pprint

class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)

    def run (self):
        response = requests.get(self.url)
        #print(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')
        mainDiv = soup.find("div", {"id": "main"})
        posts = [i for i in mainDiv.children][3:-2]
        news = []
        for post in posts:
            reg = re.compile(r"^/url.*")
            cursor = post.findAll("a", {"href": reg})
            postData = {}
            postData["headline"] = cursor[0].find("div").get_text()
            postData["source"] = cursor[0].findAll("div")[1].get_text()
            postData["timeAgo"] = cursor[1].next_sibling.find("span").get_text()
            postData["description"] = cursor[1].next_sibling.find("span").parent.get_text().split("· ")[1]
            news.append(postData)
        pprint(news)
        for h in news:
            blob = TextBlob(h["headline"] + " "+ h["description"])
            self.sentiment += blob.sentiment.polarity / len(news)
            self.subjectivity += blob.sentiment.subjectivity / len(news)
a = Analysis('Vaccine')
a.run()

print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)

几个输出:

[{'description': 'It comes after US health officials said last week they had '
                 'started a trial to evaluate a possible vaccine in Seattle. '
                 'The Chinese effort began on...',
  'headline': 'China embarks on clinical trial for virus vaccine',
  'source': 'The Star Online',
  'timeAgo': '5 saat önce'},
 {'description': 'Hanneke Schuitemaker, who is leading a team working on a '
                 'Covid-19 vaccine, tells of the latest developments and what '
                 'needs to be done now.',
  'headline': 'Vaccine scientist: ‘Everything is so new in dealing with this '
              'coronavirus’',
  'source': 'The Guardian',
  'timeAgo': '20 saat önce'},
 .
 .
 .
Vaccine Subjectivity:  0.34522727272727277 Sentiment:  0.14404040404040402
[{'description': '10 Cool Tech Gadgets To Survive Working From Home. From '
                 'Wi-Fi and cell phone signal boosters, to noise-cancelling '
                 'headphones and gadgets...',
  'headline': '10 Cool Tech Gadgets To Survive Working From Home',
  'source': 'CRN',
  'timeAgo': '2 gün önce'},
 {'description': 'Over the past few years, smart home products have dominated '
                 'the gadget space, with goods ranging from innovative updates '
                 'to the items we...',
  'headline': '6 Smart Home Gadgets That Are Actually Worth Owning',
  'source': 'Entrepreneur',
  'timeAgo': '2 hafta önce'},
 .
 .
 .
Home Gadgets Subjectivity:  0.48007305194805205 Sentiment:  0.3114683441558441

我使用了标题和描述数据来进行操作,但如果您愿意,您可以尝试一下。你现在有了数据:)

使用这个

headline_results = soup.find_all('div', {'class' : 'BNeawe vvjwJb AP7Wnd'})

您已经打印了response.text,如果您想查找具体数据,请从response.text结果中搜索

请尝试使用 select()CSS 选择器更加灵活。 CSS 个选择器 reference.

查看 SelectorGadget Chrome 扩展以获取 CSS 选择器,方法是在浏览器中单击所需的元素。

如果你想获得所有的称号等等,那么你正在寻找这个容器:

soup.select('.dbsr')

确保传递 user-agent,因为 Google 最终可能会阻止您的请求,您将收到不同的 HTML,因此输出为空。 Check what is your user-agent

通过 user-agent:

headers = {
    "User-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get("YOUR_URL", headers=headers)

我不确定你到底想做什么,但是 的解决方案有点 overkill 正如他提到的那样,切片 regex,在 div#main 做某事。就简单多了。


代码和example in the online IDE:

from textblob import TextBlob
import requests
from bs4 import BeautifulSoup

headers = {
   "User-agent":
   "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = f"https://www.google.com/search?q={self.term}&tbm=nws"
    
 
    def run (self):
        response = requests.get(self.url, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")

        news_data = []

        for result in soup.select('.dbsr'):
          title = result.select_one('.nDgy9d').text
          link = result.a['href']
          source = result.select_one('.WF4CUc').text
          snippet = result.select_one('.Y3v8qd').text
          date_published = result.select_one('.WG9SHc span').text

          news_data.append({
            "title": title,
            "link": link,
            "source": source, 
            "snippet": snippet,
            "date_published": date_published
          })

        for h in news_data:
            blob = TextBlob(f"{h['title']} {h['snippet']}")
            self.sentiment += blob.sentiment.polarity / len(news_data)
            self.subjectivity += blob.sentiment.subjectivity / len(news_data)


a = Analysis("Lasagna")
a.run()

print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)

# Vaccine Subjectivity:  0.3255952380952381 Sentiment:  0.05113636363636363
# Lasagna Subjectivity:  0.36556818181818185 Sentiment:  0.25386093073593075

或者,您可以使用 SerpApi 中的 Google News Results API 来实现相同的目的。这是付费 API 和免费计划。

你的情况的不同之处在于你不必维护解析器,弄清楚如何解析某些元素或弄清楚为什么某些东西不能正常工作,并了解如何绕过来自 Google。所有需要做的就是迭代结构化 JSON 并快速获得你想要的东西。

与您的示例集成的代码:


from textblob import TextBlob
import os
from serpapi import GoogleSearch


class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = f"https://www.google.com/search"
    
 
    def run (self):
        params = {
          "engine": "google",
          "tbm": "nws",
          "q": self.url,
          "api_key": os.getenv("API_KEY"),
        }

        search = GoogleSearch(params)
        results = search.get_dict()

        news_data = []

        for result in results['news_results']:
          title = result['title']
          link = result['link']
          snippet = result['snippet']
          source = result['source']
          date_published = result['date']

          news_data.append({
            "title": title,
            "link": link,
            "source": source, 
            "snippet": snippet,
            "date_published": date_published
          })

        for h in news_data:
            blob = TextBlob(f"{h['title']} {h['snippet']}")
            self.sentiment += blob.sentiment.polarity / len(news_data)
            self.subjectivity += blob.sentiment.subjectivity / len(news_data)


a = Analysis("Vaccine")
a.run()

print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)


# Vaccine Subjectivity:  0.30957251082251086 Sentiment:  0.06277056277056277
# Lasagna Subjectivity:  0.30957251082251086 Sentiment:  0.06277056277056277

P.S - 我写了一篇更详细的博客 post 关于如何抓取 Google News.

Disclaimer, I work for SerpApi.