如何将 pdf 抓取到文件名 = url 的本地文件夹并在迭代中延迟？

Question

我为所有包含 [=29= .pdf 对我来说很重要。这些现在存储在 relative_paths:

['http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-0065.pdf',
 'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-1679.pdf',
 'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/4444/jjjjj-99-9526.pdf',]

现在我想将 pdf 存储在本地文件夹中的链接“后面”，其中 文件名 是他们的 url.

None 的 - 虽然在互联网上有些类似的问题 - 似乎帮助我实现了我的目标。我得到的最接近的是当它生成一些甚至没有扩展名的奇怪文件时。以下是我已经尝试过的一些更有前途的代码示例。

for link in relative_paths:
    content = requests.get(link, verify = False)
    with open(link, 'wb') as pdf:
        pdf.write(content.content)

for link in relative_paths:  
    response = requests.get(url, verify = False)   
    with open(join(r'C:/Users/', basename(url)), 'wb') as f:
        f.write(response.content)

for link in relative_paths:
    filename = link
    with open(filename, 'wb') as f:
        f.write(requests.get(link, verify = False).content)

for link in relative_paths:
    pdf_response = requests.get(link, verify = False)
    filename = link
    with open(filename, 'wb') as f:
        f.write(pdf_response.content)

现在我很迷茫，不知道如何前进。你能转换其中一个 for 循环并提供一个小的解释吗？如果 url 的文件名太长，则在倒数第 3 个 / 处拆分也可以。谢谢:)

此外，网站主机要求我不要一次抓取所有 pdf，这样服务器就不会超载，因为 relative_paths 中存储的许多链接后面有数千个 pdf。这就是为什么我正在寻找一种方法来在我的请求中加入某种延迟。

Answer 1

试一试：

import time
count_downloads = 25 #<--- wait x seconds after every 25 downloads
time_delay = 60 #<--- wait 60 seconds after every y downloads

for idx, link in enumerate(relative_paths):
    if idx % count_downloads == 0:
        print ('Waiting %s seconds...' %time_delay)
        time.sleep(time_delay)
    filename = link.split('jjjjj-')[-1] #<--whatever that is is where you want to split then
    
    try:
        with open(filename, 'wb') as f:
            f.write(requests.get(link).content)
            print ('Saved: %s' %link)
    except Exception as ex:
         print('%s not saved. %s' %(link,ex))

如何将 pdf 抓取到文件名 = url 的本地文件夹并在迭代中延迟？

How to scrape pdf to local folder with filename = url and delay within iteration?

python

pdf

filenames

beautifulsoup

web-scraping