用 mechanicalsoup 下载文件
Download file with mechanicalsoup
我想在此 ONS webpage using the MechanicalSoup package in Python. I have read the MechanicalSoup documentation 上下载 Excel 文件。我在 Whosebug 和其他地方广泛搜索了一个示例,但没有成功。
我的尝试是:
# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup
# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup
# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
browser.download_link("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
在最后一行,我也尝试过:
browser.download_link(link="https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna",file="c:/test/filename.xls")
2019 年 1 月 25 日更新:感谢 AKX 下面的评论,我已经尝试了
browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))
在每种情况下,我都会收到错误消息:
mechanicalsoup.utils.LinkNotFoundError
然而 link 确实存在。尝试将其粘贴到您的地址栏以确认:
https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna
我做错了什么?
2019 年 1 月 25 日更新 2: 感谢 AKX 在下面的回答,这是回答我问题的完整 MWE(发布于以后遇到同样困难的人):
# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup
# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup
import re
# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
browser.download_link(link_text=".xls",file="c:/py/ONS_Data.xls" )
我没用过 Mechanical Soup,但是看看文档,
This function behaves similarly to follow_link()
和follow_link
说(强调我的)
- If link is a bs4.element.Tag (i.e. from a previous call to links() or find_link()), then follow the link.
- If link doesn’t have a href-attribute or is None, treat link as a url_regex and look it up with find_link(). Any additional arguments specified are forwarded to this function.
问号(除其他外)是正则表达式 (regex) 元字符,因此如果您想将它们用于 follow_link
/download_link
:[=18,则需要转义它们=]
import re
# ...
browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))
但是,如果您访问的第一页不包含该直接 link,我不确定它是否有帮助。 (尽管先尝试一下。)
您可以使用浏览器的底层 requests
会话,它可能承载 cookie jar(假设下载需要一些 cookie)来直接下载文件:
resp = browser.session.get("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
resp.raise_for_status() # raise an exception for 404, etc.
with open('filename.xls', 'wb') as outf:
outf.write(resp.content)
您混淆了 link(网页中的元素,如 <a href=... >
)和 URL(http://example.com
形式的字符串)。 MechanicalSoup 的 follow_link
在页面中查找 links 并跟随它,就像您在浏览器中单击它一样。
我想在此 ONS webpage using the MechanicalSoup package in Python. I have read the MechanicalSoup documentation 上下载 Excel 文件。我在 Whosebug 和其他地方广泛搜索了一个示例,但没有成功。
我的尝试是:
# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup
# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup
# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
browser.download_link("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
在最后一行,我也尝试过:
browser.download_link(link="https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna",file="c:/test/filename.xls")
2019 年 1 月 25 日更新:感谢 AKX 下面的评论,我已经尝试了
browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))
在每种情况下,我都会收到错误消息:
mechanicalsoup.utils.LinkNotFoundError
然而 link 确实存在。尝试将其粘贴到您的地址栏以确认:
https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna
我做错了什么?
2019 年 1 月 25 日更新 2: 感谢 AKX 在下面的回答,这是回答我问题的完整 MWE(发布于以后遇到同样困难的人):
# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup
# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup
import re
# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
browser.download_link(link_text=".xls",file="c:/py/ONS_Data.xls" )
我没用过 Mechanical Soup,但是看看文档,
This function behaves similarly to follow_link()
和follow_link
说(强调我的)
- If link is a bs4.element.Tag (i.e. from a previous call to links() or find_link()), then follow the link.
- If link doesn’t have a href-attribute or is None, treat link as a url_regex and look it up with find_link(). Any additional arguments specified are forwarded to this function.
问号(除其他外)是正则表达式 (regex) 元字符,因此如果您想将它们用于 follow_link
/download_link
:[=18,则需要转义它们=]
import re
# ...
browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))
但是,如果您访问的第一页不包含该直接 link,我不确定它是否有帮助。 (尽管先尝试一下。)
您可以使用浏览器的底层 requests
会话,它可能承载 cookie jar(假设下载需要一些 cookie)来直接下载文件:
resp = browser.session.get("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
resp.raise_for_status() # raise an exception for 404, etc.
with open('filename.xls', 'wb') as outf:
outf.write(resp.content)
您混淆了 link(网页中的元素,如 <a href=... >
)和 URL(http://example.com
形式的字符串)。 MechanicalSoup 的 follow_link
在页面中查找 links 并跟随它,就像您在浏览器中单击它一样。