Python 中的 Selenium：运行在加载所有延迟加载组件后抓取代码

Question

selenium 的新手，在搜索解决方案后我仍然有以下问题。

我正在尝试访问此网站 (https://www.ecb.europa.eu/press/pressconf/html/index.en.html) 上的所有 link。

单个 link 以“延迟加载”方式加载。随着用户向下滚动屏幕，它会逐渐加载。

driver = webdriver.Chrome("chromedriver.exe")
driver.get("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")

    # scrolling
    
    lastHeight = driver.execute_script("return document.body.scrollHeight")
    #print(lastHeight)
    
    pause = 0.5
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(pause)
        newHeight = driver.execute_script("return document.body.scrollHeight")
        if newHeight == lastHeight:
            break
        lastHeight = newHeight
        print(lastHeight)
    
    # ---
    
    elems = driver.find_elements_by_xpath("//a[@href]")
    for elem in elems:
        url=elem.get_attribute("href")
        if re.search('is\d+.en.html', url):
            print(url)

但是只获取到last懒加载元素需要的link，前面的都没有获取到，因为没有加载。

我想确保在执行任何 scraping 代码之前已加载所有延迟加载元素。我该怎么做？

非常感谢

Answer 1

Selenium 不是为 web-scraping 设计的（尽管在复杂的情况下它可能很有用）。在您的情况下，执行 F12 -> Network 并在向下滚动页面时查看 XHR 选项卡。您可以看到添加的查询在其 URL 中包含年份。所以当你向下滚动到其他年份时，页面会生成子查询。

查看响应选项卡以找到 div 和类并构建 beautifulsoup 'find_all'。通过 requests 和 bs 进行多年的简单小循环就足够了：

import requests as rq
from bs4 import BeautifulSoup as bs


headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"}

resultats = []

for year in range(1998, 2021+1, 1):

    url = "https://www.ecb.europa.eu/press/pressconf/%s/html/index_include.en.html" % year
    resp = rq.get(url, headers=headers)
    soup = bs(resp.content)

    titles = map(lambda x: x.text, soup.find_all("div", {"class", "title"}))
    subtitles = map(lambda x: x.text, soup.find_all("div", {"class", "subtitle"}))
    dates = map(lambda x: x.text, soup.find_all("dt"))

    zipped = list(zip(dates, titles, subtitles))
    resultats.extend(zipped)

结果包含：

...
('8 November 2012',
  'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
  'Mario Draghi,  President of the ECB,  Vítor Constâncio,  Vice-President of the ECB,  Frankfurt am Main,  8 November 2012'),
 ('4 October 2012',
  'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
  'Mario Draghi,  President of the ECB,  Vítor Constâncio,  Vice-President of the ECB,  Brdo pri Kranju,  4 October 2012'),
...

Python 中的 Selenium：运行在加载所有延迟加载组件后抓取代码

Selenium in Python: Run scraping code after all lazy-loading component is loaded

python

selenium

lazy-loading

web-scraping

Python 中的 Selenium：运行 在加载所有延迟加载组件后抓取代码

Selenium in Python: Run scraping code after all lazy-loading component is loaded

python

selenium

lazy-loading

web-scraping

Python 中的 Selenium：运行在加载所有延迟加载组件后抓取代码