Python3 requests-html: 不幸的是,对此页面的自动访问被拒绝
Python3 requests-html: Unfortunately, automated access to this page was denied
你好 Whosebug 社区,
几个月前,我使用 python3 和 html-requests 以及 BeautifulSoup 创建了一个抓取工具,以便从 https://www.mobile.de. The scraper uses the following search URL 抓取汽车广告以获取列表所有可用的汽车广告,然后遍历详细信息页面。
请在下面找到代码:
from bs4 import BeautifulSoup, SoupStrainer
from requests_html import HTMLSession
import re
url = 'https://suchen.mobile.de/fahrzeuge/search.html?&damageUnrepaired=NO_DAMAGE_UNREPAIRED&isSearchRequest=true&makeModelVariant1.makeId=25200&makeModelVariant1.modelId=g29&scopeId=C&sfmr=false'
session = HTMLSession()
r = session.get(url)
only_a_tags = SoupStrainer("a")
soup = BeautifulSoup(r.content,'lxml', parse_only=only_a_tags)
for link in soup.find_all('a', attrs={'href': re.compile("^https://suchen.mobile.de/fahrzeuge/details.html")}):
print (link.get("href"))
几天后,爬虫无法再从网站上抓取汽车广告了。当遍历所有标签以获取汽车广告的详细信息页面时(总是像 https://suchen.mobile.de/fahrzeuge/details.html),目前没有显示任何结果。过去,指向汽车广告详细信息页面的链接是打印出来的。
我在打印 html 内容时只收到以下错误消息:
b'<!DOCTYPE html>\n<html>\n <!--\nLeider koennen wir Dir an dieser Stelle keinen Zugriff auf unsere Daten gewaehren.\nSolltest Du weiterhin Interesse an einem Bezug unserer Daten haben, wende Dich bitte an:\n\nUnfortunately, automated access to this page was denied.\nIf you are interested in accessing our data, please contact us:\n\nPhone:\n+49 (0) 30 8109-7573\n\nMail:\nDatenpartner@team.mobile.de\n -->\n <head>\n <meta charset="UTF-8">\n\n <title>Ups, bist Du ein Mensch? / Are you a human?</title>\n <link rel="stylesheet" href="https://static.classistatic.de/shared/mde-style/2.1.0/style.css">\n <link rel="icon" type="image/x-icon" href="data:image/x-icon;base64,AAABAAEAEBAAAAAAAABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbJ7yGTV+7Msdbuv/hKzr6/347tLw8O/S8vLy0vPz89Ly8vLS8PDw0vDw79Lt7e3S6urqnNDQ0AQAAAAAXpbwGih27MEfcOv/DWbr/4iv7P//+vD/8vLy//X19f/29vb/9fX1//Pz8//y8vL/7+/v/+3t7f3q6em05OTkGTR97c4fcOv/H3Dr/w5m6/+Ir+z///rw//Ly8v/19fX/9vb2//X19f/z8/P/8vLy/+/v7//t7e3/6urq/+fn5s8hcev/H3Dr/x9w6/8OZuv/iK/s///67/////////////X19f////////////Hx8f/6+vr//v7+/+vr6//n5+f/H3Dr/x9w6/8fcOv/Dmbr/4iw7f///fX/nJ2d/5mamf/6+vr/pqen/4+Qj//4+Pj/ra6u/4WGhv/m5ub/6urq/x9w6/8fcOv/H3Dr/w5m6/+IsO3////8/y0vL/8iJCT//////0FCQv8OEBD//////1laWv8AAgL/4eHh/+3t7f8fcOv/H3Dr/x9w6/8OZuv/iLDt////+/88Pj7/MjQ0//////9PUFD/HyEh//////9kZWX/EhQU/+Li4v/t7e3/H3Dr/x9w6/8fcOv/Dmbr/4iw7f////v/PD4+/zI0NP//////T1BQ/x8hIf//////ZGVl/xIUFP/i4uL/7e3t/x9w6/8fcOv/H3Dr/w5m6/+IsO3////7/zw+Pv80Njb//////1JTU/8gIiL//////2hpaf8RExP/4uLi/+3t7f8fcOv/H3Dr/x9w6/8OZuv/iLDt////+/8/QED/ERMT/8XGxv8/QUH/BggI/7+/v/9KTEz/ExUV/+Tk5P/t7Oz/H3Dr/x9w6/8fcOv/Dmbr/4iw7f////r/SktL/y0vL/8oKir/AgMD/2FiYv82Nzf/AAAA/2BhYf/19fT/6Ojo/x9w6/8fcOv/H3Dr/w5m6/+Ir+z///rx/+jo6P/y8vL/7O3t/87Ozv/8/Pz/7e3t/8nJyf/v7u7/7Ozs/+fn5/8fcOv/H3Dr/x9w6/8OZuv/iK/s///68P/19fX/9/f3//n5+f/9/f3/9PT0//X19f/39/f/7u7u/+rq6v/n5+f/MXvs4x5w6/8fcOv/Dmbr/4iv7P//+vD/8vLy//X19f/29vb/9fX1//Pz8//y8vL/7+/v/+3t7f/q6ur/5+fmwU2M7yUoduzGH3Dr/w5m6/+Ir+z///rw//Ly8v/19fX/9vb2//X19f/z8/P/8vLy/+/v7//t7e3/6urpu+Pj4hMAAAAAc6PyGDN87McRaOv/hq7r//347f/v7+//8fHx//Ly8v/x8fH/8PDw/+/v7//t7e386+vqtuLi4RAAAAAAwAMAAIABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAQAAwAMAAA==">\n <script src=\'https://www.google.com/recaptcha/api.js\'></script>\n </head>\n <body>\n <header id="mdeHeader" class="header">\n <div class="header-meta-container header-hidden-small">\n <!-- placeholder for desktop meta -->\n </div>\n <div class="header-navbar clearfix">\n <div class="header-corporate">\n <a href="//www.mobile.de"><i class="gicon-mobilede-logo"></i></a>\n <span class="claim header-hidden-small">Deutschlands gr\xc3\xb6\xc3\x9fter Fahrzeugmarkt</span>\n </div>\n </div>\n </header>\n <div class="g-container">\n <h2 class="u-pad-bottom-18 u-margin-top-18">Ups, bist Du ein Mensch? / Are you a human?</h2>\n\n\n <div id="root"></div>\n <div class="cBox cBox--content">\n <p><b>\n Um fortzufahren muss dein Browser Cookies unterst\xc3\xbctzen und JavaScript aktiviert sein.<br>\n To continue your browser has to accept cookies and has to have JavaScript enabled.</b>\n </p>\n\n <p>\n Bei Problemen wende Dich bitte an:<br>\n In case of problems please contact:\n </p>\n <p>\n Phone: 030 81097-601<br>\n Mail: service@team.mobile.de\n </p>\n\n <p>\n Sollte grunds\xc3\xa4tzliches Interesse am Bezug von mobile.de Daten bestehen, wende Dich bitte an:<br/>\n If you are primarily interested in purchasing data from mobile.de, please contact:\n </p>\n <p>\n Mail: Datenpartner@team.mobile.de\n </p>\n </div>\n <hr class="u-pad-top-9 u-pad-bottom-18"/>\n <div id="footer"></div>\n <script async src="https://www.mobile.de/api/consent/static/js/consentBanner.js"></script>\n <script type="text/javascript" src="https://www.mobile.de/youre-blocked/app.js"></script><script type="text/javascript" >var _cf = _cf || []; _cf.push([\'_setFsp\', true]); _cf.push([\'_setBm\', true]); _cf.push([\'_setAu\', \'/static/16b9372bb8fti233b6fc758bf7a4291f0\']); </script><script type="text/javascript" src="/static/16b9372bb8fti233b6fc758bf7a4291f0"></script></body>\n</html>\n'
在创建抓取工具时,我还收到了“不幸的是,对此页面的自动访问被拒绝。”使用 urrlib 时的消息,因此我切换到 html-requests 并且一切正常。
我已经尝试通过以下方法解决它,但到目前为止 none 有效:(
- 代理轮换(我以为我的IP地址可能被屏蔽了)
- header 中的不同用户代理通过 fake_useragent 库
我希望你能提供帮助,因为我目前不知道我还能尝试什么。
非常感谢您帮助我解决这个问题:)
首先使用 Selenium Webdriver 导航至搜索页面,然后 运行 从那里进行查询。
运行我刚在我自己的机器上收到了同样的信息。当我手动访问该站点时,我还会看到一个 reCAPTCHA。即使直接用 Selenium 打开它也会生成 reCAPTCHA。
如果我想打败你,只要直接连接到搜索结果,我就需要 reCAPTCHA。那将是我对您如何被阻止的猜测。当我使用 WebDriver 首次导航到搜索页面时,我没有受到挑战。
这是我使用的代码。
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://suchen.mobile.de/fahrzeuge/search.html")
driver.implicitly_wait(5000) #not good practice, but quick and easy
driver.find_element_by_id("gdpr-consent-accept-button").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("fuels-PETROL-ds").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("dsp-upper-search-btn").click()
这不会永远有效,但至少目前有效。
你好 Whosebug 社区,
几个月前,我使用 python3 和 html-requests 以及 BeautifulSoup 创建了一个抓取工具,以便从 https://www.mobile.de. The scraper uses the following search URL 抓取汽车广告以获取列表所有可用的汽车广告,然后遍历详细信息页面。
请在下面找到代码:
from bs4 import BeautifulSoup, SoupStrainer
from requests_html import HTMLSession
import re
url = 'https://suchen.mobile.de/fahrzeuge/search.html?&damageUnrepaired=NO_DAMAGE_UNREPAIRED&isSearchRequest=true&makeModelVariant1.makeId=25200&makeModelVariant1.modelId=g29&scopeId=C&sfmr=false'
session = HTMLSession()
r = session.get(url)
only_a_tags = SoupStrainer("a")
soup = BeautifulSoup(r.content,'lxml', parse_only=only_a_tags)
for link in soup.find_all('a', attrs={'href': re.compile("^https://suchen.mobile.de/fahrzeuge/details.html")}):
print (link.get("href"))
几天后,爬虫无法再从网站上抓取汽车广告了。当遍历所有标签以获取汽车广告的详细信息页面时(总是像 https://suchen.mobile.de/fahrzeuge/details.html),目前没有显示任何结果。过去,指向汽车广告详细信息页面的链接是打印出来的。 我在打印 html 内容时只收到以下错误消息:
b'<!DOCTYPE html>\n<html>\n <!--\nLeider koennen wir Dir an dieser Stelle keinen Zugriff auf unsere Daten gewaehren.\nSolltest Du weiterhin Interesse an einem Bezug unserer Daten haben, wende Dich bitte an:\n\nUnfortunately, automated access to this page was denied.\nIf you are interested in accessing our data, please contact us:\n\nPhone:\n+49 (0) 30 8109-7573\n\nMail:\nDatenpartner@team.mobile.de\n -->\n <head>\n <meta charset="UTF-8">\n\n <title>Ups, bist Du ein Mensch? / Are you a human?</title>\n <link rel="stylesheet" href="https://static.classistatic.de/shared/mde-style/2.1.0/style.css">\n <link rel="icon" type="image/x-icon" href="data:image/x-icon;base64,AAABAAEAEBAAAAAAAABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbJ7yGTV+7Msdbuv/hKzr6/347tLw8O/S8vLy0vPz89Ly8vLS8PDw0vDw79Lt7e3S6urqnNDQ0AQAAAAAXpbwGih27MEfcOv/DWbr/4iv7P//+vD/8vLy//X19f/29vb/9fX1//Pz8//y8vL/7+/v/+3t7f3q6em05OTkGTR97c4fcOv/H3Dr/w5m6/+Ir+z///rw//Ly8v/19fX/9vb2//X19f/z8/P/8vLy/+/v7//t7e3/6urq/+fn5s8hcev/H3Dr/x9w6/8OZuv/iK/s///67/////////////X19f////////////Hx8f/6+vr//v7+/+vr6//n5+f/H3Dr/x9w6/8fcOv/Dmbr/4iw7f///fX/nJ2d/5mamf/6+vr/pqen/4+Qj//4+Pj/ra6u/4WGhv/m5ub/6urq/x9w6/8fcOv/H3Dr/w5m6/+IsO3////8/y0vL/8iJCT//////0FCQv8OEBD//////1laWv8AAgL/4eHh/+3t7f8fcOv/H3Dr/x9w6/8OZuv/iLDt////+/88Pj7/MjQ0//////9PUFD/HyEh//////9kZWX/EhQU/+Li4v/t7e3/H3Dr/x9w6/8fcOv/Dmbr/4iw7f////v/PD4+/zI0NP//////T1BQ/x8hIf//////ZGVl/xIUFP/i4uL/7e3t/x9w6/8fcOv/H3Dr/w5m6/+IsO3////7/zw+Pv80Njb//////1JTU/8gIiL//////2hpaf8RExP/4uLi/+3t7f8fcOv/H3Dr/x9w6/8OZuv/iLDt////+/8/QED/ERMT/8XGxv8/QUH/BggI/7+/v/9KTEz/ExUV/+Tk5P/t7Oz/H3Dr/x9w6/8fcOv/Dmbr/4iw7f////r/SktL/y0vL/8oKir/AgMD/2FiYv82Nzf/AAAA/2BhYf/19fT/6Ojo/x9w6/8fcOv/H3Dr/w5m6/+Ir+z///rx/+jo6P/y8vL/7O3t/87Ozv/8/Pz/7e3t/8nJyf/v7u7/7Ozs/+fn5/8fcOv/H3Dr/x9w6/8OZuv/iK/s///68P/19fX/9/f3//n5+f/9/f3/9PT0//X19f/39/f/7u7u/+rq6v/n5+f/MXvs4x5w6/8fcOv/Dmbr/4iv7P//+vD/8vLy//X19f/29vb/9fX1//Pz8//y8vL/7+/v/+3t7f/q6ur/5+fmwU2M7yUoduzGH3Dr/w5m6/+Ir+z///rw//Ly8v/19fX/9vb2//X19f/z8/P/8vLy/+/v7//t7e3/6urpu+Pj4hMAAAAAc6PyGDN87McRaOv/hq7r//347f/v7+//8fHx//Ly8v/x8fH/8PDw/+/v7//t7e386+vqtuLi4RAAAAAAwAMAAIABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAQAAwAMAAA==">\n <script src=\'https://www.google.com/recaptcha/api.js\'></script>\n </head>\n <body>\n <header id="mdeHeader" class="header">\n <div class="header-meta-container header-hidden-small">\n <!-- placeholder for desktop meta -->\n </div>\n <div class="header-navbar clearfix">\n <div class="header-corporate">\n <a href="//www.mobile.de"><i class="gicon-mobilede-logo"></i></a>\n <span class="claim header-hidden-small">Deutschlands gr\xc3\xb6\xc3\x9fter Fahrzeugmarkt</span>\n </div>\n </div>\n </header>\n <div class="g-container">\n <h2 class="u-pad-bottom-18 u-margin-top-18">Ups, bist Du ein Mensch? / Are you a human?</h2>\n\n\n <div id="root"></div>\n <div class="cBox cBox--content">\n <p><b>\n Um fortzufahren muss dein Browser Cookies unterst\xc3\xbctzen und JavaScript aktiviert sein.<br>\n To continue your browser has to accept cookies and has to have JavaScript enabled.</b>\n </p>\n\n <p>\n Bei Problemen wende Dich bitte an:<br>\n In case of problems please contact:\n </p>\n <p>\n Phone: 030 81097-601<br>\n Mail: service@team.mobile.de\n </p>\n\n <p>\n Sollte grunds\xc3\xa4tzliches Interesse am Bezug von mobile.de Daten bestehen, wende Dich bitte an:<br/>\n If you are primarily interested in purchasing data from mobile.de, please contact:\n </p>\n <p>\n Mail: Datenpartner@team.mobile.de\n </p>\n </div>\n <hr class="u-pad-top-9 u-pad-bottom-18"/>\n <div id="footer"></div>\n <script async src="https://www.mobile.de/api/consent/static/js/consentBanner.js"></script>\n <script type="text/javascript" src="https://www.mobile.de/youre-blocked/app.js"></script><script type="text/javascript" >var _cf = _cf || []; _cf.push([\'_setFsp\', true]); _cf.push([\'_setBm\', true]); _cf.push([\'_setAu\', \'/static/16b9372bb8fti233b6fc758bf7a4291f0\']); </script><script type="text/javascript" src="/static/16b9372bb8fti233b6fc758bf7a4291f0"></script></body>\n</html>\n'
在创建抓取工具时,我还收到了“不幸的是,对此页面的自动访问被拒绝。”使用 urrlib 时的消息,因此我切换到 html-requests 并且一切正常。
我已经尝试通过以下方法解决它,但到目前为止 none 有效:(
- 代理轮换(我以为我的IP地址可能被屏蔽了)
- header 中的不同用户代理通过 fake_useragent 库
我希望你能提供帮助,因为我目前不知道我还能尝试什么。
非常感谢您帮助我解决这个问题:)
首先使用 Selenium Webdriver 导航至搜索页面,然后 运行 从那里进行查询。
运行我刚在我自己的机器上收到了同样的信息。当我手动访问该站点时,我还会看到一个 reCAPTCHA。即使直接用 Selenium 打开它也会生成 reCAPTCHA。
如果我想打败你,只要直接连接到搜索结果,我就需要 reCAPTCHA。那将是我对您如何被阻止的猜测。当我使用 WebDriver 首次导航到搜索页面时,我没有受到挑战。
这是我使用的代码。
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://suchen.mobile.de/fahrzeuge/search.html")
driver.implicitly_wait(5000) #not good practice, but quick and easy
driver.find_element_by_id("gdpr-consent-accept-button").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("fuels-PETROL-ds").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("dsp-upper-search-btn").click()
这不会永远有效,但至少目前有效。