Python 为公司地址抓取 bloomberg 站点 - 在从 URL 获取 html 内容时获取 'Are you a robot' 验证码

Question

我的 python 文件在 scrapy 项目中，我的 settings.py 在 spyder 中有 robotstxt_obey = False。我已经成功安装并导入了pandas、scrapy、spyder、beautifulsoup、requests.

但是当下面的代码执行时，我在获取 HTML 代码时收到 "Are you a robot?" 验证码错误。我看过很多帖子都回答了类似的问题。但是，我无法解决该错误。我不能放整个代码，但是放有问题的主体。我希望我的问题很清楚，请帮忙。提前致谢。

代码：

if pd.isnull(row['Company']) == False or pd.isnull(row['Domain']) ==
False :
          #OR (pd.isnull(row['Company']) == False AND pd.isnull(row['Company']) == False)
          # pd.isnull(row['City']) == True and and pd.isnull(row['Address']) == True
          listUrl = []
          print(row['Domain'])
          if pd.isnull(row['Company']) == False:
              listUrl = get_urls(row['Company'] +' bloomberg', 10, 'en')
          else:
              listUrl = get_urls(row['Domain'] + ' bloomberg', 10, 'en')
          for item1 in listUrl:
              print("in bloomberg 1")
              print(item1)
              if 'www.bloomberg.com/profile/company/' in item1:
                  print("in bloomberg 2")
                  res = requests.get(item1, headers=headers)
                  print(res.content)
                  soup2 = bs(res.content, 'html.parser')
                  items = soup2.findAll("div", {"class": "infoTableItemValue__e188b0cb"})
                   print(items)

Answer 1

我遇到了同样的问题，但我可以通过在我的请求 header 中添加“user-agent”: Mozilla/80.0 来解决它。我还建议添加一些错误处理，以避免由于被阻止而无法建立连接时代码被炸毁。

for url in start_links[0:]:
    try:
        response = requests.get(url[0], timeout=5, cookies=cookies, headers={"user-agent": "Mozilla/80.0"})
        print(response, url)

    except NewConnectionError:
        continue

Python 为公司地址抓取 bloomberg 站点 - 在从 URL 获取 html 内容时获取 'Are you a robot' 验证码

Python Webscraping bloomberg site for company addresses - getting 'Are you a robot' captcha while fetching the html content from URL

python

captcha

beautifulsoup

scrapy

web-scraping