Python 中的 Selenium:运行 在加载所有延迟加载组件后抓取代码
Selenium in Python: Run scraping code after all lazy-loading component is loaded
selenium 的新手,在搜索解决方案后我仍然有以下问题。
我正在尝试访问此网站 (https://www.ecb.europa.eu/press/pressconf/html/index.en.html) 上的所有 link。
单个 link 以“延迟加载”方式加载。随着用户向下滚动屏幕,它会逐渐加载。
driver = webdriver.Chrome("chromedriver.exe")
driver.get("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")
# scrolling
lastHeight = driver.execute_script("return document.body.scrollHeight")
#print(lastHeight)
pause = 0.5
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
print(lastHeight)
# ---
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
url=elem.get_attribute("href")
if re.search('is\d+.en.html', url):
print(url)
但是只获取到last懒加载元素需要的link,前面的都没有获取到,因为没有加载。
我想确保在执行任何 scraping 代码之前已加载所有延迟加载元素。我该怎么做?
非常感谢
Selenium 不是为 web-scraping 设计的(尽管在复杂的情况下它可能很有用)。在您的情况下,执行 F12 -> Network 并在向下滚动页面时查看 XHR 选项卡。您可以看到添加的查询在其 URL 中包含年份。所以当你向下滚动到其他年份时,页面会生成子查询。
查看响应选项卡以找到 div 和 类 并构建 beautifulsoup 'find_all'。
通过 requests 和 bs 进行多年的简单小循环就足够了:
import requests as rq
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"}
resultats = []
for year in range(1998, 2021+1, 1):
url = "https://www.ecb.europa.eu/press/pressconf/%s/html/index_include.en.html" % year
resp = rq.get(url, headers=headers)
soup = bs(resp.content)
titles = map(lambda x: x.text, soup.find_all("div", {"class", "title"}))
subtitles = map(lambda x: x.text, soup.find_all("div", {"class", "subtitle"}))
dates = map(lambda x: x.text, soup.find_all("dt"))
zipped = list(zip(dates, titles, subtitles))
resultats.extend(zipped)
结果包含:
...
('8 November 2012',
'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi, President of the ECB, Vítor Constâncio, Vice-President of the ECB, Frankfurt am Main, 8 November 2012'),
('4 October 2012',
'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi, President of the ECB, Vítor Constâncio, Vice-President of the ECB, Brdo pri Kranju, 4 October 2012'),
...
selenium 的新手,在搜索解决方案后我仍然有以下问题。
我正在尝试访问此网站 (https://www.ecb.europa.eu/press/pressconf/html/index.en.html) 上的所有 link。
单个 link 以“延迟加载”方式加载。随着用户向下滚动屏幕,它会逐渐加载。
driver = webdriver.Chrome("chromedriver.exe")
driver.get("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")
# scrolling
lastHeight = driver.execute_script("return document.body.scrollHeight")
#print(lastHeight)
pause = 0.5
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
print(lastHeight)
# ---
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
url=elem.get_attribute("href")
if re.search('is\d+.en.html', url):
print(url)
但是只获取到last懒加载元素需要的link,前面的都没有获取到,因为没有加载。
我想确保在执行任何 scraping 代码之前已加载所有延迟加载元素。我该怎么做?
非常感谢
Selenium 不是为 web-scraping 设计的(尽管在复杂的情况下它可能很有用)。在您的情况下,执行 F12 -> Network 并在向下滚动页面时查看 XHR 选项卡。您可以看到添加的查询在其 URL 中包含年份。所以当你向下滚动到其他年份时,页面会生成子查询。
查看响应选项卡以找到 div 和 类 并构建 beautifulsoup 'find_all'。 通过 requests 和 bs 进行多年的简单小循环就足够了:
import requests as rq
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"}
resultats = []
for year in range(1998, 2021+1, 1):
url = "https://www.ecb.europa.eu/press/pressconf/%s/html/index_include.en.html" % year
resp = rq.get(url, headers=headers)
soup = bs(resp.content)
titles = map(lambda x: x.text, soup.find_all("div", {"class", "title"}))
subtitles = map(lambda x: x.text, soup.find_all("div", {"class", "subtitle"}))
dates = map(lambda x: x.text, soup.find_all("dt"))
zipped = list(zip(dates, titles, subtitles))
resultats.extend(zipped)
结果包含:
...
('8 November 2012',
'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi, President of the ECB, Vítor Constâncio, Vice-President of the ECB, Frankfurt am Main, 8 November 2012'),
('4 October 2012',
'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi, President of the ECB, Vítor Constâncio, Vice-President of the ECB, Brdo pri Kranju, 4 October 2012'),
...