如何使用 Python 请求下载文件,当该文件通过重定向提供时?
How to download a file using Python requests, when that file is being served with redirect?
我正在尝试从 Fadedpage 下载一本书,比如 this one。如果你点击 link 到 HTML 文件,它会显示 HTML 文件。 URL 似乎是 https://www.fadedpage.com/books/20170817/html.php
。但是,如果您尝试通过任何常规方式下载 URL,您只会获得元数据 HTML,而不是包含本书全文的 HTML。例如,命令行中的 运行 wget https://www.fadedpage.com/books/20170817/html.php
执行 return HTML,但它又是元数据 HTML 文件来自https://www.fadedpage.com/showbook.php?pid=20170817
,不是本书的全文。
这是我到目前为止尝试过的方法:
def downloadFile(bookID, fileType="html"):
url = f"https://www.fadedpage.com/books/{bookID}/{fileType}.php"
#url = f'https://www.fadedpage.com/link.php?file={bookID}.{fileType}'
headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.3 Chrome/87.0.4280.144 Safari/537.36",
"referer": "https://www.fadedpage.com/showbook.php?pid={bookID}",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"
}
print("Getting ", url)
resp = requests.get(url, headers=headers, cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"})
if resp.ok:
return resp.text
我正在尝试为它提供与我的网络浏览器相同的 headers,希望它会 return 相同。但是没用。
我还需要做些什么才能下载这个 HTML 文件吗?由于它由 PHP 在服务器端提供服务,因此我很难对其进行逆向工程。
作为参考,完整的 HTML 文件包含文本“本书的第一部分适用于高级到能够区分词性的学生。”但是该文本不包含在元数据 HTML 文件中。
测试
这是另一种测试方法:
def isValidDownload(bookID, fileType="html"):
"""
A download of `downloadFile("20170817", "html")` should produce
a file 20170817.html which contains the text "It was a woodland
slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting
the full text file.
"""
with open(f"{bookID}.{fileType}") as f:
raw = f.read()
test = "woodland slope behind St. Pierre-les-Bains"
return test in raw
这应该return True
:
downloadFile("20170817", "html")
isValidDownload("20170817", "html")
False
再次尝试
基于以下答案的更简单版本也不起作用。全部在这里:
def downloadFile(bookID, fileType):
headers = {"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}
url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
print("Getting ", url)
with requests.get(url, headers = headers) as resp:
with open(f"{bookID}.{fileType}", 'wb') as f:
f.write(resp.content)
def isValidDownload(bookID, fileType="html"):
"""
A download of `downloadFile("20170817", "html")` should produce
a file 20170817.html which contains the text "It was a woodland
slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting
the full text file.
"""
with open(f"{bookID}.{fileType}") as f:
raw = f.read()
test = "woodland slope behind St. Pierre-les-Bains"
return test in raw
downloadFile("20170817", "html")
isValidDownload("20170817", "html")
那 returns False
.
- 传递
cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"}
而不是 headers={"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}
。
这是因为 requests
库在重定向时执行 headers.pop('Cookie', None)
。
- 如果
resp.url
不是 f"https://www.fadedpage.com/books/{bookID}/{fileType}.php"
,请重试。
这是因为服务器首先将具有不同 bookID
的 link.php
重定向到 showbook.php
.
downloadFile("20170817", "html")
下载包含文本 "The First Part of this book is intended for pupils"
,而不是 downloadFile("20130603", "html")
. 下载中包含的 "woodland slope behind St. Pierre-les-Bains"
def downloadFile(bookID, fileType, retry=1):
cookies = {"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"}
url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
print("Getting ", url)
with requests.get(url, cookies=cookies) as resp:
if resp.url != f"https://www.fadedpage.com/books/{bookID}/{fileType}.php":
if retry:
return downloadFile(bookID, fileType, retry=retry-1)
else:
raise Exception
with open(f"{bookID}.{fileType}", 'wb') as f:
f.write(resp.content)
def isValidDownload(bookID, fileType="html"):
"""
A download of `downloadFile("20170817", "html")` should produce
a file 20170817.html which contains the text "The First Part of
this book is intended for pupils". If it doesn't, it isn't getting
the full text file.
"""
with open(f"{bookID}.{fileType}") as f:
raw = f.read()
test = ""
if bookID == "20130603":
test = "woodland slope behind St. Pierre-les-Bains"
if bookID == "20170817":
test = "The First Part of this book is intended for pupils"
return test in raw
我正在尝试从 Fadedpage 下载一本书,比如 this one。如果你点击 link 到 HTML 文件,它会显示 HTML 文件。 URL 似乎是 https://www.fadedpage.com/books/20170817/html.php
。但是,如果您尝试通过任何常规方式下载 URL,您只会获得元数据 HTML,而不是包含本书全文的 HTML。例如,命令行中的 运行 wget https://www.fadedpage.com/books/20170817/html.php
执行 return HTML,但它又是元数据 HTML 文件来自https://www.fadedpage.com/showbook.php?pid=20170817
,不是本书的全文。
这是我到目前为止尝试过的方法:
def downloadFile(bookID, fileType="html"):
url = f"https://www.fadedpage.com/books/{bookID}/{fileType}.php"
#url = f'https://www.fadedpage.com/link.php?file={bookID}.{fileType}'
headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.3 Chrome/87.0.4280.144 Safari/537.36",
"referer": "https://www.fadedpage.com/showbook.php?pid={bookID}",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"
}
print("Getting ", url)
resp = requests.get(url, headers=headers, cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"})
if resp.ok:
return resp.text
我正在尝试为它提供与我的网络浏览器相同的 headers,希望它会 return 相同。但是没用。
我还需要做些什么才能下载这个 HTML 文件吗?由于它由 PHP 在服务器端提供服务,因此我很难对其进行逆向工程。
作为参考,完整的 HTML 文件包含文本“本书的第一部分适用于高级到能够区分词性的学生。”但是该文本不包含在元数据 HTML 文件中。
测试
这是另一种测试方法:
def isValidDownload(bookID, fileType="html"):
"""
A download of `downloadFile("20170817", "html")` should produce
a file 20170817.html which contains the text "It was a woodland
slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting
the full text file.
"""
with open(f"{bookID}.{fileType}") as f:
raw = f.read()
test = "woodland slope behind St. Pierre-les-Bains"
return test in raw
这应该return True
:
downloadFile("20170817", "html")
isValidDownload("20170817", "html")
False
再次尝试
基于以下答案的更简单版本也不起作用。全部在这里:
def downloadFile(bookID, fileType):
headers = {"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}
url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
print("Getting ", url)
with requests.get(url, headers = headers) as resp:
with open(f"{bookID}.{fileType}", 'wb') as f:
f.write(resp.content)
def isValidDownload(bookID, fileType="html"):
"""
A download of `downloadFile("20170817", "html")` should produce
a file 20170817.html which contains the text "It was a woodland
slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting
the full text file.
"""
with open(f"{bookID}.{fileType}") as f:
raw = f.read()
test = "woodland slope behind St. Pierre-les-Bains"
return test in raw
downloadFile("20170817", "html")
isValidDownload("20170817", "html")
那 returns False
.
- 传递
cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"}
而不是headers={"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}
。
这是因为requests
库在重定向时执行headers.pop('Cookie', None)
。 - 如果
resp.url
不是f"https://www.fadedpage.com/books/{bookID}/{fileType}.php"
,请重试。
这是因为服务器首先将具有不同bookID
的link.php
重定向到showbook.php
. downloadFile("20170817", "html")
下载包含文本"The First Part of this book is intended for pupils"
,而不是downloadFile("20130603", "html")
. 下载中包含的
"woodland slope behind St. Pierre-les-Bains"
def downloadFile(bookID, fileType, retry=1):
cookies = {"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"}
url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
print("Getting ", url)
with requests.get(url, cookies=cookies) as resp:
if resp.url != f"https://www.fadedpage.com/books/{bookID}/{fileType}.php":
if retry:
return downloadFile(bookID, fileType, retry=retry-1)
else:
raise Exception
with open(f"{bookID}.{fileType}", 'wb') as f:
f.write(resp.content)
def isValidDownload(bookID, fileType="html"):
"""
A download of `downloadFile("20170817", "html")` should produce
a file 20170817.html which contains the text "The First Part of
this book is intended for pupils". If it doesn't, it isn't getting
the full text file.
"""
with open(f"{bookID}.{fileType}") as f:
raw = f.read()
test = ""
if bookID == "20130603":
test = "woodland slope behind St. Pierre-les-Bains"
if bookID == "20170817":
test = "The First Part of this book is intended for pupils"
return test in raw