Python3 requests-html: 不幸的是，对此页面的自动访问被拒绝

Question

你好 Whosebug 社区，

几个月前，我使用 python3 和 html-requests 以及 BeautifulSoup 创建了一个抓取工具，以便从 https://www.mobile.de. The scraper uses the following search URL 抓取汽车广告以获取列表所有可用的汽车广告，然后遍历详细信息页面。

请在下面找到代码：

from bs4 import BeautifulSoup, SoupStrainer
from requests_html import HTMLSession
import re

url = 'https://suchen.mobile.de/fahrzeuge/search.html?&damageUnrepaired=NO_DAMAGE_UNREPAIRED&isSearchRequest=true&makeModelVariant1.makeId=25200&makeModelVariant1.modelId=g29&scopeId=C&sfmr=false'

session = HTMLSession()
r = session.get(url)
only_a_tags = SoupStrainer("a")
soup = BeautifulSoup(r.content,'lxml', parse_only=only_a_tags)
for link in soup.find_all('a', attrs={'href': re.compile("^https://suchen.mobile.de/fahrzeuge/details.html")}):
   print (link.get("href"))

几天后，爬虫无法再从网站上抓取汽车广告了。当遍历所有标签以获取汽车广告的详细信息页面时（总是像 https://suchen.mobile.de/fahrzeuge/details.html），目前没有显示任何结果。过去，指向汽车广告详细信息页面的链接是打印出来的。我在打印 html 内容时只收到以下错误消息：

b'<!DOCTYPE html>\n<html>\n  <!--\nLeider koennen wir Dir an dieser Stelle keinen Zugriff auf unsere Daten gewaehren.\nSolltest Du weiterhin Interesse an einem Bezug unserer Daten haben, wende Dich bitte an:\n\nUnfortunately, automated access to this page was denied.\nIf you are interested in accessing our data, please contact us:\n\nPhone:\n+49 (0) 30 8109-7573\n\nMail:\nDatenpartner@team.mobile.de\n  -->\n  <head>\n    <meta charset="UTF-8">\n\n    <title>Ups, bist Du ein Mensch? / Are you a human?</title>\n        <link rel="stylesheet" href="https://static.classistatic.de/shared/mde-style/2.1.0/style.css">\n    <link rel="icon" type="image/x-icon" href="data:image/x-icon;base64,AAABAAEAEBAAAAAAAABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbJ7yGTV+7Msdbuv/hKzr6/347tLw8O/S8vLy0vPz89Ly8vLS8PDw0vDw79Lt7e3S6urqnNDQ0AQAAAAAXpbwGih27MEfcOv/DWbr/4iv7P//+vD/8vLy//X19f/29vb/9fX1//Pz8//y8vL/7+/v/+3t7f3q6em05OTkGTR97c4fcOv/H3Dr/w5m6/+Ir+z///rw//Ly8v/19fX/9vb2//X19f/z8/P/8vLy/+/v7//t7e3/6urq/+fn5s8hcev/H3Dr/x9w6/8OZuv/iK/s///67/////////////X19f////////////Hx8f/6+vr//v7+/+vr6//n5+f/H3Dr/x9w6/8fcOv/Dmbr/4iw7f///fX/nJ2d/5mamf/6+vr/pqen/4+Qj//4+Pj/ra6u/4WGhv/m5ub/6urq/x9w6/8fcOv/H3Dr/w5m6/+IsO3////8/y0vL/8iJCT//////0FCQv8OEBD//////1laWv8AAgL/4eHh/+3t7f8fcOv/H3Dr/x9w6/8OZuv/iLDt////+/88Pj7/MjQ0//////9PUFD/HyEh//////9kZWX/EhQU/+Li4v/t7e3/H3Dr/x9w6/8fcOv/Dmbr/4iw7f////v/PD4+/zI0NP//////T1BQ/x8hIf//////ZGVl/xIUFP/i4uL/7e3t/x9w6/8fcOv/H3Dr/w5m6/+IsO3////7/zw+Pv80Njb//////1JTU/8gIiL//////2hpaf8RExP/4uLi/+3t7f8fcOv/H3Dr/x9w6/8OZuv/iLDt////+/8/QED/ERMT/8XGxv8/QUH/BggI/7+/v/9KTEz/ExUV/+Tk5P/t7Oz/H3Dr/x9w6/8fcOv/Dmbr/4iw7f////r/SktL/y0vL/8oKir/AgMD/2FiYv82Nzf/AAAA/2BhYf/19fT/6Ojo/x9w6/8fcOv/H3Dr/w5m6/+Ir+z///rx/+jo6P/y8vL/7O3t/87Ozv/8/Pz/7e3t/8nJyf/v7u7/7Ozs/+fn5/8fcOv/H3Dr/x9w6/8OZuv/iK/s///68P/19fX/9/f3//n5+f/9/f3/9PT0//X19f/39/f/7u7u/+rq6v/n5+f/MXvs4x5w6/8fcOv/Dmbr/4iv7P//+vD/8vLy//X19f/29vb/9fX1//Pz8//y8vL/7+/v/+3t7f/q6ur/5+fmwU2M7yUoduzGH3Dr/w5m6/+Ir+z///rw//Ly8v/19fX/9vb2//X19f/z8/P/8vLy/+/v7//t7e3/6urpu+Pj4hMAAAAAc6PyGDN87McRaOv/hq7r//347f/v7+//8fHx//Ly8v/x8fH/8PDw/+/v7//t7e386+vqtuLi4RAAAAAAwAMAAIABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAQAAwAMAAA==">\n    <script src=\'https://www.google.com/recaptcha/api.js\'></script>\n  </head>\n  <body>\n    <header id="mdeHeader" class="header">\n      <div class="header-meta-container header-hidden-small">\n        <!-- placeholder for desktop meta -->\n      </div>\n      <div class="header-navbar clearfix">\n        <div class="header-corporate">\n          <a href="//www.mobile.de"><i class="gicon-mobilede-logo"></i></a>\n          <span class="claim header-hidden-small">Deutschlands gr\xc3\xb6\xc3\x9fter Fahrzeugmarkt</span>\n        </div>\n      </div>\n    </header>\n  <div class="g-container">\n    <h2 class="u-pad-bottom-18 u-margin-top-18">Ups, bist Du ein Mensch? / Are you a human?</h2>\n\n\n    <div id="root"></div>\n    <div class="cBox cBox--content">\n      <p><b>\n        Um fortzufahren muss dein Browser Cookies unterst\xc3\xbctzen und JavaScript aktiviert sein.<br>\n        To continue your browser has to accept cookies and has to have JavaScript enabled.</b>\n      </p>\n\n      <p>\n        Bei Problemen wende Dich bitte an:<br>\n        In case of problems please contact:\n      </p>\n      <p>\n        Phone: 030 81097-601<br>\n        Mail: service@team.mobile.de\n      </p>\n\n      <p>\n        Sollte grunds\xc3\xa4tzliches Interesse am Bezug von mobile.de Daten bestehen, wende Dich bitte an:<br/>\n        If you are primarily interested in purchasing data from mobile.de, please contact:\n      </p>\n      <p>\n        Mail: Datenpartner@team.mobile.de\n      </p>\n    </div>\n    <hr class="u-pad-top-9 u-pad-bottom-18"/>\n    <div id="footer"></div>\n    <script async src="https://www.mobile.de/api/consent/static/js/consentBanner.js"></script>\n  <script type="text/javascript" src="https://www.mobile.de/youre-blocked/app.js"></script><script type="text/javascript" >var _cf = _cf || []; _cf.push([\'_setFsp\', true]);  _cf.push([\'_setBm\', true]);  _cf.push([\'_setAu\', \'/static/16b9372bb8fti233b6fc758bf7a4291f0\']); </script><script type="text/javascript"  src="/static/16b9372bb8fti233b6fc758bf7a4291f0"></script></body>\n</html>\n'

在创建抓取工具时，我还收到了“不幸的是，对此页面的自动访问被拒绝。”使用 urrlib 时的消息，因此我切换到 html-requests 并且一切正常。

我已经尝试通过以下方法解决它，但到目前为止 none 有效:(

代理轮换（我以为我的IP地址可能被屏蔽了）
header 中的不同用户代理通过 fake_useragent 库

我希望你能提供帮助，因为我目前不知道我还能尝试什么。

非常感谢您帮助我解决这个问题:)

Answer 1

首先使用 Selenium Webdriver 导航至搜索页面，然后运行从那里进行查询。

运行我刚在我自己的机器上收到了同样的信息。当我手动访问该站点时，我还会看到一个 reCAPTCHA。即使直接用 Selenium 打开它也会生成 reCAPTCHA。

如果我想打败你，只要直接连接到搜索结果，我就需要 reCAPTCHA。那将是我对您如何被阻止的猜测。当我使用 WebDriver 首次导航到搜索页面时，我没有受到挑战。

这是我使用的代码。

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://suchen.mobile.de/fahrzeuge/search.html")
driver.implicitly_wait(5000) #not good practice, but quick and easy
driver.find_element_by_id("gdpr-consent-accept-button").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("fuels-PETROL-ds").click()
driver.implicitly_wait(2000) #not good practice, but quick and easy
driver.find_element_by_id("dsp-upper-search-btn").click()

这不会永远有效，但至少目前有效。

Python3 requests-html: 不幸的是，对此页面的自动访问被拒绝

Python3 requests-html: Unfortunately, automated access to this page was denied

beautifulsoup

web-crawler

web-scraping

python-requests-html