如何网络抓取没有来源的图像?

How to web-scrape images which does not have source?

Link:https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0

此网站有我需要抓取的图像形式的问题。但是我什至无法获得 link 到他们的来源,它输出 links 到一些加载 gif。当我看到源代码时,甚至没有任何图像的“src”。您可以在上面提供的 link 上查看该网站的工作方式。我怎样才能下载所有这些图片?

from bs4 import BeautifulSoup
import requests
import os

url = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

images = soup.find_all('img')

for image in images:
    link = image['src']

    print (link)

由于页面是动态的 BeautifulSoup 在这里不起作用。必须使用 selenium

  1. 导航到站点
  2. 使用 xpath 获取所有问题://div/div[3]/center/table/tbody/tr/td[1]/center/a 并循环并单击它们。
  3. 使用xpath获取图像源://*[@id="question_prev"]/div[2]/img/@src然后获取并保存图像。

问题 ID 作为页面的一部分嵌入,尝试使用 re(regex) 模块提取 ID。

import re
import requests
from bs4 import BeautifulSoup

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}

URL = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"
BASE_URL = "https://www.exam-mate.com"

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

for tag in soup.select("td:nth-of-type(1) a"):
    # Find the question id within the page
    question_link = re.search(r"/questions.*\.png", tag["onclick"]).group()
    print(BASE_URL + question_link)

输出:

https://www.exam-mate.com/questions/1240/1362/1240_q_1362_1_1.png
https://www.exam-mate.com/questions/1240/1363/1240_q_1363_2_1.png
https://www.exam-mate.com/questions/1240/1364/1240_q_1364_3_1.png
https://www.exam-mate.com/questions/1240/1365/1240_q_1365_4_1.png
https://www.exam-mate.com/questions/1240/1366/1240_q_1366_5_1.png
...And on