我需要对 googlenews 进行网络抓取,以获取来自不同报纸的不同文章的 link

I need to do web scraping to googlenews to get the link for different articles from different newspaper

我需要对 google 新闻进行网络抓取,以获取来自不同报纸的不同文章的 link,我有一个代码非常适合今天的新闻(来自 google消息)。但是,它不适用于较旧的文章。例如,此代码用于从 google news:

获取不同的文章 links
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
import time
from newspaper import Article
import random
import pandas as pd

root = 'https://www.google.com/'
time.sleep(random.randint(0, 3)) #----------stop---------#

link = 'https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&tbm=nws&ei=qEWUYorfOuiy5OUP-aGLgA4&ved=0ahUKEwiK07Wfr4b4AhVoGbkGHfnQAuAQ4dUDCA0&uact=5&oq=revuelta+la+tercera&gs_lcp=Cgxnd3Mtd2l6LW5ld3MQAzIFCCEQoAEyBQghEKABOgsIABCABBCxAxCDAToFCAAQgAQ6CAgAEIAEELEDOggIABCxAxCDAToKCAAQsQMQgwEQQzoECAAQQzoECAAQCjoGCAAQHhAWOggIABAeEA8QFlDIEliUnwFg1aABaAVwAHgAgAGSAYgBuw-SAQQyMS4ymAEAoAEBsAEAwAEB&sclient=gws-wiz-news'
time.sleep(random.randint(0, 6)) #----------stop---------#

req = Request(link, headers = {'User-Agent': 'Mozilla/5.0'})
time.sleep(random.randint(0, 3)) #----------stop---------#

requests.get(link, headers = {'User-agent': 'your bot 0.1'})
time.sleep(random.randint(0, 6)) #----------stop---------#

webpage = urlopen(req).read()
time.sleep(random.randint(0, 6)) #----------stop---------#

with requests.Session() as c:
    soup = BeautifulSoup(webpage, 'html5lib')
    for item in soup.find_all('div', attrs= {'class': 'ZINbbc luh4tb xpd O9g5cc uUPGi'}):
        raw_link = item.find('a', href=True)['href']

        link = raw_link.split('/url?q=')[1].split('&sa=U&')[0]

        article = Article(link, language = "es")
        
        article.download()
        
        article.parse()
        
        title = article.title
        
        descript = article.text
        
        date = article.publish_date
        
        print(title)
        print(descript)
        print(link)

现在我需要更改相同搜索的日期,所以我只需将 link 更改为自定义间隔:

root = 'https://www.google.com/'
time.sleep(random.randint(0, 3)) #----------stop---------#

link = 'https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F6%2F2018&tbm=nws'
time.sleep(random.randint(0, 6)) #----------stop---------#

req = Request(link, headers = {'User-Agent': 'Mozilla/5.0'})
time.sleep(random.randint(0, 3)) #----------stop---------#

requests.get(link, headers = {'User-agent': 'your bot 0.1'})
time.sleep(random.randint(0, 6)) #----------stop---------#

webpage = urlopen(req).read()
time.sleep(random.randint(0, 6)) #----------stop---------#

with requests.Session() as c:
    soup = BeautifulSoup(webpage, 'html5lib')
    for item in soup.find_all('div', attrs= {'class': 'ZINbbc luh4tb xpd O9g5cc uUPGi'}):
        raw_link = item.find('a', href=True)['href']

        link = raw_link.split('/url?q=')[1].split('&sa=U&')[0]

        article = Article(link, language = "es")
        
        article.download()
        
        article.parse()
        
        title = article.title
        
        descript = article.text
        
        date = article.publish_date
        
        print(title)
        print(descript)
        print(link)

link 应该是不同的(由于搜索日期的变化),但它们都给我相同的结果,我不明白为什么。请大家帮忙,我不知道怎么解决。

根据您提供的URL是

https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F6%2F2018&tbm=nws

如果你仔细阅读 cd_mincd_max 似乎包含日期时间数据。

cd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F6%2F2018

所以让我们从 URL 中分割它们。上面的字符串是从 URL 中截取的,并且是 URL 编码的。如果你解码它,你会看到..

cd_min:1/1/2018,cd_max:1/6/2018

因此,如果您想更改查询的日期,您应该更改为 URL。

from urllib import parse

# URL ENCODE
# DON'T FORGET : and , 
start_date = parse.quote_plus(":1/1/2018,")
end_date = parse.quote_plus(":1/6/2018")

# CREATE QUERY
link = f"https://www.google.com/search?q=revuelta+la+tercera&rlz=1C1UEAD_esCL995CL995&biw=1536&bih=714&source=lnt&tbs=cdr%3A1%2Ccd_min{start_date}cd_max{end_date}&tbm=nws"

随机化日期的代码是你的工作:)