如何将 pdf 抓取到文件名 = url 的本地文件夹并在迭代中延迟?
How to scrape pdf to local folder with filename = url and delay within iteration?
我为所有包含 [=29= .pdf 对我来说很重要。
这些现在存储在 relative_paths
:
['http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-0065.pdf',
'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-1679.pdf',
'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/4444/jjjjj-99-9526.pdf',]
现在我想将 pdf 存储在本地文件夹中的链接“后面”,其中 文件名 是他们的 url.
None 的 - 虽然在互联网上有些类似的问题 - 似乎帮助我实现了我的目标。我得到的最接近的是当它生成一些甚至没有扩展名的奇怪文件时。以下是我已经尝试过的一些更有前途的代码示例。
for link in relative_paths:
content = requests.get(link, verify = False)
with open(link, 'wb') as pdf:
pdf.write(content.content)
for link in relative_paths:
response = requests.get(url, verify = False)
with open(join(r'C:/Users/', basename(url)), 'wb') as f:
f.write(response.content)
for link in relative_paths:
filename = link
with open(filename, 'wb') as f:
f.write(requests.get(link, verify = False).content)
for link in relative_paths:
pdf_response = requests.get(link, verify = False)
filename = link
with open(filename, 'wb') as f:
f.write(pdf_response.content)
现在我很迷茫,不知道如何前进。你能转换其中一个 for 循环并提供一个小的解释吗?如果 url 的文件名太长,则在倒数第 3 个 /
处拆分也可以。谢谢:)
此外,网站主机要求我不要一次抓取所有 pdf,这样服务器就不会超载,因为 relative_paths
中存储的许多链接后面有数千个 pdf。这就是为什么我正在寻找一种方法来在我的请求中加入某种延迟。
试一试:
import time
count_downloads = 25 #<--- wait x seconds after every 25 downloads
time_delay = 60 #<--- wait 60 seconds after every y downloads
for idx, link in enumerate(relative_paths):
if idx % count_downloads == 0:
print ('Waiting %s seconds...' %time_delay)
time.sleep(time_delay)
filename = link.split('jjjjj-')[-1] #<--whatever that is is where you want to split then
try:
with open(filename, 'wb') as f:
f.write(requests.get(link).content)
print ('Saved: %s' %link)
except Exception as ex:
print('%s not saved. %s' %(link,ex))
我为所有包含 [=29= .pdf 对我来说很重要。
这些现在存储在 relative_paths
:
['http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-0065.pdf',
'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-1679.pdf',
'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/4444/jjjjj-99-9526.pdf',]
现在我想将 pdf 存储在本地文件夹中的链接“后面”,其中 文件名 是他们的 url.
None 的 - 虽然在互联网上有些类似的问题 - 似乎帮助我实现了我的目标。我得到的最接近的是当它生成一些甚至没有扩展名的奇怪文件时。以下是我已经尝试过的一些更有前途的代码示例。
for link in relative_paths:
content = requests.get(link, verify = False)
with open(link, 'wb') as pdf:
pdf.write(content.content)
for link in relative_paths:
response = requests.get(url, verify = False)
with open(join(r'C:/Users/', basename(url)), 'wb') as f:
f.write(response.content)
for link in relative_paths:
filename = link
with open(filename, 'wb') as f:
f.write(requests.get(link, verify = False).content)
for link in relative_paths:
pdf_response = requests.get(link, verify = False)
filename = link
with open(filename, 'wb') as f:
f.write(pdf_response.content)
现在我很迷茫,不知道如何前进。你能转换其中一个 for 循环并提供一个小的解释吗?如果 url 的文件名太长,则在倒数第 3 个 /
处拆分也可以。谢谢:)
此外,网站主机要求我不要一次抓取所有 pdf,这样服务器就不会超载,因为 relative_paths
中存储的许多链接后面有数千个 pdf。这就是为什么我正在寻找一种方法来在我的请求中加入某种延迟。
试一试:
import time
count_downloads = 25 #<--- wait x seconds after every 25 downloads
time_delay = 60 #<--- wait 60 seconds after every y downloads
for idx, link in enumerate(relative_paths):
if idx % count_downloads == 0:
print ('Waiting %s seconds...' %time_delay)
time.sleep(time_delay)
filename = link.split('jjjjj-')[-1] #<--whatever that is is where you want to split then
try:
with open(filename, 'wb') as f:
f.write(requests.get(link).content)
print ('Saved: %s' %link)
except Exception as ex:
print('%s not saved. %s' %(link,ex))