如何使用 Python 请求下载文件，当该文件通过重定向提供时？

Question

我正在尝试从 Fadedpage 下载一本书，比如 this one。如果你点击 link 到 HTML 文件，它会显示 HTML 文件。 URL 似乎是 https://www.fadedpage.com/books/20170817/html.php。但是，如果您尝试通过任何常规方式下载 URL，您只会获得元数据 HTML，而不是包含本书全文的 HTML。例如，命令行中的运行 wget https://www.fadedpage.com/books/20170817/html.php 执行 return HTML，但它又是元数据 HTML 文件来自https://www.fadedpage.com/showbook.php?pid=20170817，不是本书的全文。

这是我到目前为止尝试过的方法：

def downloadFile(bookID, fileType="html"): 
    url = f"https://www.fadedpage.com/books/{bookID}/{fileType}.php"
    #url = f'https://www.fadedpage.com/link.php?file={bookID}.{fileType}'
    headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
               "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.3 Chrome/87.0.4280.144 Safari/537.36",
               "referer": "https://www.fadedpage.com/showbook.php?pid={bookID}",
               "sec-fetch-dest": "document",
                "sec-fetch-mode": "navigate",
                "sec-fetch-site": "same-origin",
                "sec-fetch-user": "?1",
                "upgrade-insecure-requests": "1",
                "cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"
              }
    print("Getting ", url)
    resp = requests.get(url, headers=headers, cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"})
    if resp.ok: 
        return resp.text

我正在尝试为它提供与我的网络浏览器相同的 headers，希望它会 return 相同。但是没用。

我还需要做些什么才能下载这个 HTML 文件吗？由于它由 PHP 在服务器端提供服务，因此我很难对其进行逆向工程。

作为参考，完整的 HTML 文件包含文本“本书的第一部分适用于高级到能够区分词性的学生。”但是该文本不包含在元数据 HTML 文件中。

测试

这是另一种测试方法：

def isValidDownload(bookID, fileType="html"): 
    """
    A download of `downloadFile("20170817", "html")` should produce
    a file 20170817.html which contains the text "It was a woodland 
    slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting 
    the full text file. 
    """
    with open(f"{bookID}.{fileType}") as f: 
        raw = f.read()
    test = "woodland slope behind St. Pierre-les-Bains"
    return test in raw

这应该return True:

downloadFile("20170817", "html")
isValidDownload("20170817", "html")

False

再次尝试

基于以下答案的更简单版本也不起作用。全部在这里：

def downloadFile(bookID, fileType): 
    headers = {"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}
    url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
    print("Getting ", url)
    with requests.get(url, headers = headers) as resp:
        with open(f"{bookID}.{fileType}", 'wb') as f:
            f.write(resp.content)

def isValidDownload(bookID, fileType="html"): 
    """
    A download of `downloadFile("20170817", "html")` should produce
    a file 20170817.html which contains the text "It was a woodland 
    slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting 
    the full text file. 
    """
    with open(f"{bookID}.{fileType}") as f: 
        raw = f.read()
    test = "woodland slope behind St. Pierre-les-Bains"
    return test in raw

downloadFile("20170817", "html")
isValidDownload("20170817", "html")

那 returns False.

Answer 1

传递 cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"} 而不是 headers={"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}。
这是因为 requests 库在重定向时执行 headers.pop('Cookie', None)。
如果 resp.url 不是 f"https://www.fadedpage.com/books/{bookID}/{fileType}.php"，请重试。
这是因为服务器首先将具有不同 bookID 的 link.php 重定向到 showbook.php.
downloadFile("20170817", "html") 下载包含文本 "The First Part of this book is intended for pupils"，而不是 downloadFile("20130603", "html").

"woodland slope behind St. Pierre-les-Bains"

def downloadFile(bookID, fileType, retry=1):
    cookies = {"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"}
    url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
    print("Getting ", url)
    with requests.get(url, cookies=cookies) as resp:
        if resp.url != f"https://www.fadedpage.com/books/{bookID}/{fileType}.php":
            if retry:
                return downloadFile(bookID, fileType, retry=retry-1)
            else:
                raise Exception
        with open(f"{bookID}.{fileType}", 'wb') as f:
            f.write(resp.content)

def isValidDownload(bookID, fileType="html"):
    """
    A download of `downloadFile("20170817", "html")` should produce
    a file 20170817.html which contains the text "The First Part of
    this book is intended for pupils". If it doesn't, it isn't getting
    the full text file.
    """
    with open(f"{bookID}.{fileType}") as f:
        raw = f.read()
    test = ""
    if bookID == "20130603":
        test = "woodland slope behind St. Pierre-les-Bains"
    if bookID == "20170817":
        test = "The First Part of this book is intended for pupils"
    return test in raw

如何使用 Python 请求下载文件，当该文件通过重定向提供时？

How to download a file using Python requests, when that file is being served with redirect?

python

cookies

redirect

http-headers

python-requests

测试

再次尝试