用 mechanicalsoup 下载文件

Question

我想在此 ONS webpage using the MechanicalSoup package in Python. I have read the MechanicalSoup documentation 上下载 Excel 文件。我在 Whosebug 和其他地方广泛搜索了一个示例，但没有成功。

我的尝试是：

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

在最后一行，我也尝试过：

browser.download_link(link="https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna",file="c:/test/filename.xls")

2019 年 1 月 25 日更新：感谢 AKX 下面的评论，我已经尝试了

browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))

在每种情况下，我都会收到错误消息：

mechanicalsoup.utils.LinkNotFoundError

然而 link 确实存在。尝试将其粘贴到您的地址栏以确认：

https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna

我做错了什么？

2019 年 1 月 25 日更新 2： 感谢 AKX 在下面的回答，这是回答我问题的完整 MWE（发布于以后遇到同样困难的人）：

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup
import re

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link(link_text=".xls",file="c:/py/ONS_Data.xls" )

Answer 1

我没用过 Mechanical Soup，但是看看文档，

This function behaves similarly to follow_link()

和follow_link说（强调我的）

If link is a bs4.element.Tag (i.e. from a previous call to links() or find_link()), then follow the link.

If link doesn’t have a href-attribute or is None, treat link as a url_regex and look it up with find_link(). Any additional arguments specified are forwarded to this function.

问号（除其他外）是正则表达式 (regex) 元字符，因此如果您想将它们用于 follow_link/download_link:[=18，则需要转义它们=]

import re
# ...
browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))

但是，如果您访问的第一页不包含该直接 link，我不确定它是否有帮助。（尽管先尝试一下。）

您可以使用浏览器的底层 requests 会话，它可能承载 cookie jar（假设下载需要一些 cookie）来直接下载文件：

resp = browser.session.get("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
resp.raise_for_status()  # raise an exception for 404, etc.
with open('filename.xls', 'wb') as outf:
  outf.write(resp.content)

Answer 2

您混淆了 link（网页中的元素，如 <a href=... >）和 URL（http://example.com 形式的字符串）。 MechanicalSoup 的 follow_link 在页面中查找 links 并跟随它，就像您在浏览器中单击它一样。

用 mechanicalsoup 下载文件

Download file with mechanicalsoup

python

mechanicalsoup