使用 requests/BeautifullSoup 在 HTTPS 后下载 PDF 将不起作用
Downloading PDF's behind HTTPS with requests/BeautifullSoup wont work
我正在尝试完成以下任务:
- 在需要登录的网页上查找所有 .PDF 文件
- 将 .PDF 文件重命名为只有文件名而不是完整的 URL
- 在本地用户桌面上创建一个文件夹
- 只下载创建的文件夹中不存在的文件
-下载给定的.PDF文件到新文件夹
下面的代码登录网站并检索所有.PDF 文件,将名称斜杠为文件名并将它们下载到文件夹中。但是所有关闭的文件似乎都已损坏(无法打开)
任何关于如何修复它的反馈或建议都将不胜感激。 (有效载荷已更改为不泄露任何凭据)
附加信息:
Sampleurl为网站登录后的主页面。
loginurl 是用户进行身份验证的页面
secure_url 是包含我要下载的所有 .PDF 的页面
代码:
# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request
# Fetch username
username = os.getlogin()
# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)
Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')
payload = {
'username': 'xxxx',
'password': 'xxx',
'ltfejs': 'xx'
}
with requests.session() as s:
print("Connecting to website")
s.post(loginurl, data=payload)
r = s.get(secure_url)
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
print("Gathering .PDF files")
# clean the pdf link names
url_list = []
for el in links:
if(el['href'].startswith('https:')):
url_list.append(el['href'])
else:
url_list.append(Sampleurl + el['href'])
pprint(url_list)
print("Downloading .PDF files")
# download the pdfs to a specified location
for url in url_list:
print(url)
fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
if not os.path.exists(folder_location):os.mkdir(folder_location)
print(fullfilename)
request.urlretrieve(Sampleurl,fullfilename)
print("This program will automatically close in 5 seconds ")
time.sleep(5)
输出
Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\quickscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\fullscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\improvementscan.pdf
https://www.tict.io/downloads/privacylabel.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\privacylabel.pdf
This program will automatically close in 5 seconds
当手动单击输出中的超链接之一时,它会下载一个有效的 .PDF。
编辑
我已经调整了我的代码,现在它确实将工作 PDF 下载到分配的文件夹,但是它只占用列表中的最后一个文件,不会对其他文件重复该循环
print("Downloading .PDF files")
# download the pdfs to a specified location
for PDF in url_list:
fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
if not os.path.exists(folder_location):os.mkdir(folder_location)
myfile = requests.get(PDF)
open(fullfilename, 'wb').write(myfile.content)
print("This program will automaticly close in 5 seconds ")
time.sleep(5)
仅 privacylabel.pdf(url_list 中的最后一个文件)
被下载。其他人不会出现在文件夹中。
打印 PDF 时它也只 returns privacylabel.pdf
工作
我忘了调用 session 作为 s
myfile = requests.get(PDF)
应该是
myfile = s.get(PDF)
感兴趣的人的工作代码:
# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request
# Fetch username
username = os.getlogin()
# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)
Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')
Username = input("Username: ")
Password = input("Password: ")
payload = {
'username': (Username),
'password': (Password),
'ltfejs': 'xxx'
}
with requests.session() as s:
print("Connecting to website")
s.post(loginurl, data=payload)
r = s.get(secure_url)
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
print("Gathering .PDF files")
# clean the pdf link names
url_list = []
for el in links:
if(el['href'].startswith('https:')):
url_list.append(el['href'])
else:
url_list.append(Sampleurl + el['href'])
pprint(url_list)
print("Downloading .PDF files")
# download the pdfs to a specified location
for url in url_list:
fullfilename = os.path.join(folder_location, url.split("/")[-1])
if not os.path.exists(folder_location):os.mkdir(folder_location)
myfile = s.get(url)
print(url)
print("Myfile response:",myfile)
open(fullfilename, 'wb').write(myfile.content)
print("This program will automaticly close in 5 seconds ")
time.sleep(5)
输出
Username: xxxx
Password: xxxx
Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/downloads/privacylabel.pdf
Myfile response: <Response [200]>
This program will automatically close in 5 seconds
结论
- 我不得不调用 session as s,因为我忘了这样做,无法访问文件
- 我不得不稍微更改下载代码,因为最初尝试使用 urlretrieve 而不是请求下载
我正在尝试完成以下任务: - 在需要登录的网页上查找所有 .PDF 文件 - 将 .PDF 文件重命名为只有文件名而不是完整的 URL - 在本地用户桌面上创建一个文件夹 - 只下载创建的文件夹中不存在的文件 -下载给定的.PDF文件到新文件夹
下面的代码登录网站并检索所有.PDF 文件,将名称斜杠为文件名并将它们下载到文件夹中。但是所有关闭的文件似乎都已损坏(无法打开)
任何关于如何修复它的反馈或建议都将不胜感激。 (有效载荷已更改为不泄露任何凭据)
附加信息:
Sampleurl为网站登录后的主页面。 loginurl 是用户进行身份验证的页面 secure_url 是包含我要下载的所有 .PDF 的页面
代码:
# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request
# Fetch username
username = os.getlogin()
# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)
Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')
payload = {
'username': 'xxxx',
'password': 'xxx',
'ltfejs': 'xx'
}
with requests.session() as s:
print("Connecting to website")
s.post(loginurl, data=payload)
r = s.get(secure_url)
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
print("Gathering .PDF files")
# clean the pdf link names
url_list = []
for el in links:
if(el['href'].startswith('https:')):
url_list.append(el['href'])
else:
url_list.append(Sampleurl + el['href'])
pprint(url_list)
print("Downloading .PDF files")
# download the pdfs to a specified location
for url in url_list:
print(url)
fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
if not os.path.exists(folder_location):os.mkdir(folder_location)
print(fullfilename)
request.urlretrieve(Sampleurl,fullfilename)
print("This program will automatically close in 5 seconds ")
time.sleep(5)
输出
Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\quickscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\fullscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\improvementscan.pdf
https://www.tict.io/downloads/privacylabel.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\privacylabel.pdf
This program will automatically close in 5 seconds
当手动单击输出中的超链接之一时,它会下载一个有效的 .PDF。
编辑
我已经调整了我的代码,现在它确实将工作 PDF 下载到分配的文件夹,但是它只占用列表中的最后一个文件,不会对其他文件重复该循环
print("Downloading .PDF files")
# download the pdfs to a specified location
for PDF in url_list:
fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
if not os.path.exists(folder_location):os.mkdir(folder_location)
myfile = requests.get(PDF)
open(fullfilename, 'wb').write(myfile.content)
print("This program will automaticly close in 5 seconds ")
time.sleep(5)
仅 privacylabel.pdf(url_list 中的最后一个文件) 被下载。其他人不会出现在文件夹中。 打印 PDF 时它也只 returns privacylabel.pdf
工作
我忘了调用 session 作为 s
myfile = requests.get(PDF)
应该是
myfile = s.get(PDF)
感兴趣的人的工作代码:
# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request
# Fetch username
username = os.getlogin()
# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)
Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')
Username = input("Username: ")
Password = input("Password: ")
payload = {
'username': (Username),
'password': (Password),
'ltfejs': 'xxx'
}
with requests.session() as s:
print("Connecting to website")
s.post(loginurl, data=payload)
r = s.get(secure_url)
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
print("Gathering .PDF files")
# clean the pdf link names
url_list = []
for el in links:
if(el['href'].startswith('https:')):
url_list.append(el['href'])
else:
url_list.append(Sampleurl + el['href'])
pprint(url_list)
print("Downloading .PDF files")
# download the pdfs to a specified location
for url in url_list:
fullfilename = os.path.join(folder_location, url.split("/")[-1])
if not os.path.exists(folder_location):os.mkdir(folder_location)
myfile = s.get(url)
print(url)
print("Myfile response:",myfile)
open(fullfilename, 'wb').write(myfile.content)
print("This program will automaticly close in 5 seconds ")
time.sleep(5)
输出
Username: xxxx
Password: xxxx
Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/downloads/privacylabel.pdf
Myfile response: <Response [200]>
This program will automatically close in 5 seconds
结论
- 我不得不调用 session as s,因为我忘了这样做,无法访问文件
- 我不得不稍微更改下载代码,因为最初尝试使用 urlretrieve 而不是请求下载