使用 requests/BeautifullSoup 在 HTTPS 后下载 PDF 将不起作用

Question

我正在尝试完成以下任务： - 在需要登录的网页上查找所有 .PDF 文件 - 将 .PDF 文件重命名为只有文件名而不是完整的 URL - 在本地用户桌面上创建一个文件夹 - 只下载创建的文件夹中不存在的文件 -下载给定的.PDF文件到新文件夹

下面的代码登录网站并检索所有.PDF 文件，将名称斜杠为文件名并将它们下载到文件夹中。但是所有关闭的文件似乎都已损坏（无法打开）

任何关于如何修复它的反馈或建议都将不胜感激。（有效载荷已更改为不泄露任何凭据）

附加信息：

Sampleurl为网站登录后的主页面。 loginurl 是用户进行身份验证的页面 secure_url 是包含我要下载的所有 .PDF 的页面

代码：

# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request

# Fetch username
username = os.getlogin()    

# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)

Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')



payload = {
    'username': 'xxxx',
    'password': 'xxx',
    'ltfejs': 'xx'
    
}



  
with requests.session() as s:
    print("Connecting to website")
    s.post(loginurl, data=payload)
    r = s.get(secure_url)
    soup = BeautifulSoup(r.content, 'html.parser')
    links = soup.find_all('a', href=re.compile(r'(.pdf)'))


    print("Gathering .PDF files")
    # clean the pdf link names
    url_list = []
    for el in links:
        if(el['href'].startswith('https:')):
            url_list.append(el['href'])
        else:
            url_list.append(Sampleurl + el['href'])
    
    pprint(url_list)


    
    print("Downloading .PDF files")
        
    # download the pdfs to a specified location
    for url in url_list:
        print(url)
        fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
        if not os.path.exists(folder_location):os.mkdir(folder_location)    
        print(fullfilename)
        request.urlretrieve(Sampleurl,fullfilename)

     
            
print("This program will automatically close in 5 seconds ")
time.sleep(5)

输出

Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
 'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\quickscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\fullscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\improvementscan.pdf
https://www.tict.io/downloads/privacylabel.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\privacylabel.pdf
This program will automatically close in 5 seconds

当手动单击输出中的超链接之一时，它会下载一个有效的 .PDF。

编辑

我已经调整了我的代码，现在它确实将工作 PDF 下载到分配的文件夹，但是它只占用列表中的最后一个文件，不会对其他文件重复该循环

    print("Downloading .PDF files")
        
    # download the pdfs to a specified location
    for PDF in url_list:
        fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
        if not os.path.exists(folder_location):os.mkdir(folder_location)    
        myfile = requests.get(PDF) 
        open(fullfilename, 'wb').write(myfile.content)
        

print("This program will automaticly close in 5 seconds ")
time.sleep(5)

仅 privacylabel.pdf（url_list 中的最后一个文件）被下载。其他人不会出现在文件夹中。打印 PDF 时它也只 returns privacylabel.pdf

Answer 1

工作

我忘了调用 session 作为 s

myfile = requests.get(PDF)

应该是

myfile = s.get(PDF)

感兴趣的人的工作代码：

# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request


# Fetch username
username = os.getlogin()    

# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)

Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')


    

Username = input("Username: ")
Password = input("Password: ")

payload = {
    'username': (Username),
    'password': (Password),
    'ltfejs': 'xxx'
    
}

  
with requests.session() as s:
    print("Connecting to website")
    s.post(loginurl, data=payload)
    r = s.get(secure_url)
    soup = BeautifulSoup(r.content, 'html.parser')
    links = soup.find_all('a', href=re.compile(r'(.pdf)'))

    print("Gathering .PDF files")
    # clean the pdf link names
    url_list = []
    for el in links:
        if(el['href'].startswith('https:')):
            url_list.append(el['href'])
        else:
            url_list.append(Sampleurl + el['href'])
    
    pprint(url_list)

   
    
    print("Downloading .PDF files")
    
# download the pdfs to a specified location
    for url in url_list:
        fullfilename = os.path.join(folder_location, url.split("/")[-1])
        if not os.path.exists(folder_location):os.mkdir(folder_location)    
        myfile = s.get(url)
        print(url)
        print("Myfile response:",myfile)
        open(fullfilename, 'wb').write(myfile.content)
                

print("This program will automaticly close in 5 seconds ")
time.sleep(5)

输出

Username: xxxx
Password: xxxx
Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
 'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/downloads/privacylabel.pdf
Myfile response: <Response [200]>
This program will automatically close in 5 seconds

结论

我不得不调用 session as s，因为我忘了这样做，无法访问文件
我不得不稍微更改下载代码，因为最初尝试使用 urlretrieve 而不是请求下载

使用 requests/BeautifullSoup 在 HTTPS 后下载 PDF 将不起作用

Downloading PDF's behind HTTPS with requests/BeautifullSoup wont work

python

pdf

https

beautifulsoup

python-requests

附加信息：

代码：

输出

编辑

工作

感兴趣的人的工作代码：

输出

结论