为什么 Python 模块请求下载 HTML 页面而不是文件?
Why is the Python module requests downloading the HTML page instead of a file?
我有一个 .xlsx
文件,我想用 Python 下载。如果我点击下面的 URL https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx 它会自动下载它没有问题。然而,下面的代码
import requests
url = "https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx"
with open("this_is_a_test.xlsx", "wb") as f:
r = requests.get(url)
f.write(r.content)
print(r.ok)
输出 True
并下载 HTML 页面而不是 xlsx 文件。更令人沮丧的是,同样的代码以前工作得很好,但由于某种原因在过去 24 小时内改变了它的行为。
This thread and this thread 讨论了类似的问题,但是在这两种情况下都存在登录障碍,而在我的情况下不存在。
编辑 1:执行上面的代码并在我的终端中输入 head this_is_a_test.xlsx
后,这是我得到的输出:
<!DOCTYPE html>
<html lang="en" class="pb-page" data-request-id="fe043004-5c5a-4d2e-a323-cc9b39aa3339"><head data-pb-dropzone="head"><meta name="pbContext" content=";wgroup:string:Publication Websites;page:string:Cookie Absent;website:website:aaas-site" />
<script>AAASdataLayer={"page":{"pageInfo":{"pageTitle":"","pageURL":"https://www.science.org/action/cookieAbsent"},"attributes":{}},"user":{}};if(AAASdataLayer&&AAASdataLayer.user){let match=document.cookie&&document.cookie.match(/(?:^|; )consent=([^;]*)/);if(match){let jsonObj=JSON.parse(decodeURIComponent(match[1]));AAASdataLayer.user.cookieConsent=jsonObj.Marketing?'true':'false';}}</script> <link type="text/css" rel="stylesheet" href="/pb-assets/css/local-1639500397097.css">
<title>AAAS</title>
<meta charset="UTF-8">
<meta name="robots" content="noarchive,noindex,nofollow" />
<meta property="og:title" content="AAAS" />
<meta property="og:type" content="Website" />
<meta property="og:site_name" content="AAAS" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1, user-scalable=0" />
EDIT 2:好的,显然代码在执行一次时下载了 Excel 文件,但在第二次执行时改变了行为。手动下载(通过单击 link)仍然有效。所以,我想可能还有解决方法?
您需要改用二进制格式写入。This person urllib2 也有类似的问题。只要将二进制输出写入文件,您仍然可以使用请求。
更多 Pythonic 代码示例:
import requests
dls = "https://www.example.com/important.xls"
resp = requests.get(dls)
with open('test.xls', 'wb') as output:
output.write(resp.content)
我试过了,重新创建结果没有任何问题。
一个可能的解决方案是添加一个 header User-Agent
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx"
with open("this_is_a_test.xlsx", "wb") as f:
r = requests.get(url, headers=headers)
f.write(r.content)
print(r.ok)
我有一个 .xlsx
文件,我想用 Python 下载。如果我点击下面的 URL https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx 它会自动下载它没有问题。然而,下面的代码
import requests
url = "https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx"
with open("this_is_a_test.xlsx", "wb") as f:
r = requests.get(url)
f.write(r.content)
print(r.ok)
输出 True
并下载 HTML 页面而不是 xlsx 文件。更令人沮丧的是,同样的代码以前工作得很好,但由于某种原因在过去 24 小时内改变了它的行为。
This thread and this thread 讨论了类似的问题,但是在这两种情况下都存在登录障碍,而在我的情况下不存在。
编辑 1:执行上面的代码并在我的终端中输入 head this_is_a_test.xlsx
后,这是我得到的输出:
<!DOCTYPE html>
<html lang="en" class="pb-page" data-request-id="fe043004-5c5a-4d2e-a323-cc9b39aa3339"><head data-pb-dropzone="head"><meta name="pbContext" content=";wgroup:string:Publication Websites;page:string:Cookie Absent;website:website:aaas-site" />
<script>AAASdataLayer={"page":{"pageInfo":{"pageTitle":"","pageURL":"https://www.science.org/action/cookieAbsent"},"attributes":{}},"user":{}};if(AAASdataLayer&&AAASdataLayer.user){let match=document.cookie&&document.cookie.match(/(?:^|; )consent=([^;]*)/);if(match){let jsonObj=JSON.parse(decodeURIComponent(match[1]));AAASdataLayer.user.cookieConsent=jsonObj.Marketing?'true':'false';}}</script> <link type="text/css" rel="stylesheet" href="/pb-assets/css/local-1639500397097.css">
<title>AAAS</title>
<meta charset="UTF-8">
<meta name="robots" content="noarchive,noindex,nofollow" />
<meta property="og:title" content="AAAS" />
<meta property="og:type" content="Website" />
<meta property="og:site_name" content="AAAS" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1, user-scalable=0" />
EDIT 2:好的,显然代码在执行一次时下载了 Excel 文件,但在第二次执行时改变了行为。手动下载(通过单击 link)仍然有效。所以,我想可能还有解决方法?
您需要改用二进制格式写入。This person urllib2 也有类似的问题。只要将二进制输出写入文件,您仍然可以使用请求。
更多 Pythonic 代码示例:
import requests
dls = "https://www.example.com/important.xls"
resp = requests.get(dls)
with open('test.xls', 'wb') as output:
output.write(resp.content)
我试过了,重新创建结果没有任何问题。
一个可能的解决方案是添加一个 header User-Agent
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx"
with open("this_is_a_test.xlsx", "wb") as f:
r = requests.get(url, headers=headers)
f.write(r.content)
print(r.ok)