如何网络抓取没有来源的图像?
How to web-scrape images which does not have source?
此网站有我需要抓取的图像形式的问题。但是我什至无法获得 link 到他们的来源,它输出 links 到一些加载 gif。当我看到源代码时,甚至没有任何图像的“src”。您可以在上面提供的 link 上查看该网站的工作方式。我怎样才能下载所有这些图片?
from bs4 import BeautifulSoup
import requests
import os
url = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
images = soup.find_all('img')
for image in images:
link = image['src']
print (link)
由于页面是动态的 BeautifulSoup 在这里不起作用。必须使用 selenium
- 导航到站点
- 使用 xpath 获取所有问题:
//div/div[3]/center/table/tbody/tr/td[1]/center/a
并循环并单击它们。
- 使用xpath获取图像源:
//*[@id="question_prev"]/div[2]/img/@src
然后获取并保存图像。
问题 ID 作为页面的一部分嵌入,尝试使用 re
(regex) 模块提取 ID。
import re
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
URL = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"
BASE_URL = "https://www.exam-mate.com"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for tag in soup.select("td:nth-of-type(1) a"):
# Find the question id within the page
question_link = re.search(r"/questions.*\.png", tag["onclick"]).group()
print(BASE_URL + question_link)
输出:
https://www.exam-mate.com/questions/1240/1362/1240_q_1362_1_1.png
https://www.exam-mate.com/questions/1240/1363/1240_q_1363_2_1.png
https://www.exam-mate.com/questions/1240/1364/1240_q_1364_3_1.png
https://www.exam-mate.com/questions/1240/1365/1240_q_1365_4_1.png
https://www.exam-mate.com/questions/1240/1366/1240_q_1366_5_1.png
...And on
此网站有我需要抓取的图像形式的问题。但是我什至无法获得 link 到他们的来源,它输出 links 到一些加载 gif。当我看到源代码时,甚至没有任何图像的“src”。您可以在上面提供的 link 上查看该网站的工作方式。我怎样才能下载所有这些图片?
from bs4 import BeautifulSoup
import requests
import os
url = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
images = soup.find_all('img')
for image in images:
link = image['src']
print (link)
由于页面是动态的 BeautifulSoup 在这里不起作用。必须使用 selenium
- 导航到站点
- 使用 xpath 获取所有问题:
//div/div[3]/center/table/tbody/tr/td[1]/center/a
并循环并单击它们。 - 使用xpath获取图像源:
//*[@id="question_prev"]/div[2]/img/@src
然后获取并保存图像。
问题 ID 作为页面的一部分嵌入,尝试使用 re
(regex) 模块提取 ID。
import re
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
URL = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"
BASE_URL = "https://www.exam-mate.com"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for tag in soup.select("td:nth-of-type(1) a"):
# Find the question id within the page
question_link = re.search(r"/questions.*\.png", tag["onclick"]).group()
print(BASE_URL + question_link)
输出:
https://www.exam-mate.com/questions/1240/1362/1240_q_1362_1_1.png
https://www.exam-mate.com/questions/1240/1363/1240_q_1363_2_1.png
https://www.exam-mate.com/questions/1240/1364/1240_q_1364_3_1.png
https://www.exam-mate.com/questions/1240/1365/1240_q_1365_4_1.png
https://www.exam-mate.com/questions/1240/1366/1240_q_1366_5_1.png
...And on