如何使用 Python 为路透社网站的一个子版块(例如中东)获取超过 20 个新闻标题链接?

How to get more than 20 news headline links for a subsection (e.g. Middle East) of Reuters website using Python?

我正试图在路透社网站上搜索与中东相关的所有新闻标题。 Link到网页:https://www.reuters.com/subjects/middle-east

当我向下滚动时,此页面会自动显示以前的标题,但当我查看页面源代码时,它只提供最后 20 个标题链接。

我试图寻找下一个或上一个通常存在的超链接来解决此类问题,但不幸的是,此页面上没有任何此类超链接。

import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.reuters.com/subjects/middle-east'

result = requests.get(url)
content = result.content
soup = BeautifulSoup(content, 'html.parser')  

# Gets all the links on the page source
links = []
for hl in soup.find_all('a'):
    if re.search('article', hl['href']):
        links.append(hl['href'])

# The first link is the page itself and so we skip it
links = links[1:]

# The urls are repeated and so we only keep the unique instances
urls = []
for url in links:
    if url not in urls:
        urls.append(url)

# The number of urls is limited to 20 (THE PROBLEM!)
print(len(urls))

我对这一切的经验非常有限,但我最好的猜测是 java 或页面使用的任何代码语言使它在向下滚动时产生以前的结果,这可能是我需要的想办法使用 Python.

的一些模块

代码进一步从每个链接中提取其他详细信息,但这与发布的问题无关。

您可以使用 seleniumKeys.PAGE_DOWN 选项先向下滚动,然后获取页面源代码。如果愿意,您可以将其提供给 BeautifulSoup。

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re

browser = webdriver.Chrome(executable_path='/path/to/chromedriver')
browser.get("https://www.reuters.com/subjects/middle-east")
time.sleep(1)

elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 25
while no_of_pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.2)
    no_of_pagedowns-=1

source=browser.page_source
soup = BeautifulSoup(source, 'html.parser')

# Gets all the links on the page source
links = []
for hl in soup.find_all('a'):
    if re.search('article', hl['href']):
        links.append(hl['href'])

# The first link is the page itself and so we skip it
links = links[1:]

# The urls are repeated and so we only keep the unique instances
urls = []
for url in links:
    if url not in urls:
        urls.append(url)

# The number of urls is limited to 20 (THE PROBLEM!)
print(len(urls))

输出

40