如何网络抓取隐藏在选择选项下的 PDF?
How to webscrape PDFs that are hidden under the selection option?
我正在尝试使用 python 从网站下载 >100 pdf。但是,这些 pdf 隐藏在选择选项下。例如:
- 选项 1
- 选项 2
- 选项 3
...
然后,如果我选择选项 1,我会说谎:
- 选项 1
- 可点击 Link 到信息 1 [可点击 link 到文件 1]
- 可点击 Link 到信息 2 [可点击 link 到文件 2]
- 可点击 Link 到信息 3 [可点击 link 到文件 3]
- 可点击 Link 到信息 4 [可点击 link 到文件 4]
...
- 选项 2
- 选项 3
一旦我按下,例如“可单击 link 到文件 1”,图片就会弹出,并在弹出窗口的右上角显示一个“查看 PDF”选项。现在如何为选项 1 下的每个文件循环下载 PDF?我是网络抓取的新手,非常感谢您的帮助。
谢谢!
您似乎可以从 link 标识符自动构建 PDF Url。例如:
import requests
from bs4 import BeautifulSoup
url = "https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/search-recherche/lst/results-resultats.cfm?Lang=E&TABID=1&G=1&Geo1=&Code1=&Geo2=&Code2=&GEOCODE=35&type=0"
map_url = "https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/{id1}/{id2}.pdf"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("a[data-dguid]"):
id_ = a["data-dguid"]
m = map_url.format(id1=id_[4:9], id2=id_)
print("{:<60} {}".format(a["data-geoname"], m))
打印:
...
Map: Arthur [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100022.pdf
Map: Atikokan [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100028.pdf
Map: Attawapiskat 91A [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101497.pdf
Map: Aylmer [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100030.pdf
Map: Ayr [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100031.pdf
Map: Azilda [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101498.pdf
Map: Ballantrae [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101370.pdf
Map: Barrie [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100043.pdf
Map: Barry's Bay [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100044.pdf
Map: Bath [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101403.pdf
...
我正在尝试使用 python 从网站下载 >100 pdf。但是,这些 pdf 隐藏在选择选项下。例如:
- 选项 1
- 选项 2
- 选项 3 ...
然后,如果我选择选项 1,我会说谎:
- 选项 1
- 可点击 Link 到信息 1 [可点击 link 到文件 1]
- 可点击 Link 到信息 2 [可点击 link 到文件 2]
- 可点击 Link 到信息 3 [可点击 link 到文件 3]
- 可点击 Link 到信息 4 [可点击 link 到文件 4] ...
- 选项 2
- 选项 3
一旦我按下,例如“可单击 link 到文件 1”,图片就会弹出,并在弹出窗口的右上角显示一个“查看 PDF”选项。现在如何为选项 1 下的每个文件循环下载 PDF?我是网络抓取的新手,非常感谢您的帮助。
谢谢!
您似乎可以从 link 标识符自动构建 PDF Url。例如:
import requests
from bs4 import BeautifulSoup
url = "https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/search-recherche/lst/results-resultats.cfm?Lang=E&TABID=1&G=1&Geo1=&Code1=&Geo2=&Code2=&GEOCODE=35&type=0"
map_url = "https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/{id1}/{id2}.pdf"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("a[data-dguid]"):
id_ = a["data-dguid"]
m = map_url.format(id1=id_[4:9], id2=id_)
print("{:<60} {}".format(a["data-geoname"], m))
打印:
...
Map: Arthur [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100022.pdf
Map: Atikokan [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100028.pdf
Map: Attawapiskat 91A [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101497.pdf
Map: Aylmer [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100030.pdf
Map: Ayr [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100031.pdf
Map: Azilda [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101498.pdf
Map: Ballantrae [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101370.pdf
Map: Barrie [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100043.pdf
Map: Barry's Bay [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100044.pdf
Map: Bath [Population center], Ontario https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101403.pdf
...