用 mechanicalsoup 下载文件

Download file with mechanicalsoup

我想在此 ONS webpage using the MechanicalSoup package in Python. I have read the MechanicalSoup documentation 上下载 Excel 文件。我在 Whosebug 和其他地方广泛搜索了一个示例,但没有成功。

我的尝试是:

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

在最后一行,我也尝试过:

browser.download_link(link="https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna",file="c:/test/filename.xls")

2019 年 1 月 25 日更新:感谢 AKX 下面的评论,我已经尝试了

browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))

在每种情况下,我都会收到错误消息:

mechanicalsoup.utils.LinkNotFoundError

然而 link 确实存在。尝试将其粘贴到您的地址栏以确认:

https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna

我做错了什么?

2019 年 1 月 25 日更新 2: 感谢 AKX 在下面的回答,这是回答我问题的完整 MWE(发布于以后遇到同样困难的人):

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup
import re

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link(link_text=".xls",file="c:/py/ONS_Data.xls" )

我没用过 Mechanical Soup,但是看看文档,

This function behaves similarly to follow_link()

follow_link说(强调我的)

  • If link is a bs4.element.Tag (i.e. from a previous call to links() or find_link()), then follow the link.
  • If link doesn’t have a href-attribute or is None, treat link as a url_regex and look it up with find_link(). Any additional arguments specified are forwarded to this function.

问号(除其他外)是正则表达式 (regex) 元字符,因此如果您想将它们用于 follow_link/download_link:[=18,则需要转义它们=]

import re
# ...
browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))

但是,如果您访问的第一页不包含该直接 link,我不确定它是否有帮助。 (尽管先尝试一下。)

您可以使用浏览器的底层 requests 会话,它可能承载 cookie jar(假设下载需要一些 cookie)来直接下载文件:

resp = browser.session.get("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
resp.raise_for_status()  # raise an exception for 404, etc.
with open('filename.xls', 'wb') as outf:
  outf.write(resp.content)

您混淆了 link(网页中的元素,如 <a href=... >)和 URL(http://example.com 形式的字符串)。 MechanicalSoup 的 follow_link 在页面中查找 links 并跟随它,就像您在浏览器中单击它一样。