403 HTTP 状态码未处理或不允许

Question

我正在尝试从 https://www.taylorwimpey.co.uk/sitemap 获取位置列表。它在我的浏览器中正常打开，但是当我尝试使用 scrapy 时，我什么也得不到并且：

2022-04-30 11:49:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-30 11:49:22 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.taylorwimpey.co.uk/sitemap> (referer: None)
2022-04-30 11:49:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.taylorwimpey.co.uk/sitemap>: HTTP status code is not handled or not allowed
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Closing spider (finished)
Starting csv blank line cleaning
2022-04-30 11:49:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 233,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 2020,
 'downloader/response_count': 1,
 'downloader/response_status_count/403': 1,
 'elapsed_time_seconds': 2.297067,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 30, 10, 49, 22, 111984),
 'httpcompression/response_bytes': 3932,
 'httpcompression/response_count': 1,
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/403': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 11,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 30, 10, 49, 19, 814917)}
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Spider closed (finished)

我已经尝试在 setting/py 中进行调整，例如更改用户代理，但目前还没有奏效。

我的代码是：

import scrapy

from TaylorWimpey.items import TaylorwimpeyItem

from scrapy.http import TextResponse
from selenium import webdriver

class taylorwimpeySpider(scrapy.Spider):
 
    name = "taylorwimpey"
    allowed_domains = ["taylorwimpey.co.uk"]

    start_urls = ["https://www.taylorwimpey.co.uk/sitemap"]

    def __init__(self):
        try:
            self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
        except:
            self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")       
    

    def parse(self, response): # build a list of all locations
        self.driver.get(response.url)
        response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        
        url_list1 = []
        
        for href in response1.xpath('//div[@class="content-container"]/ul/li/a/@href'):
            url = response1.urljoin(href.extract())
            url_list1.append(url)
            print(url)

对做什么有什么看法？

Answer 1

您收到 403 是因为该网站受 CloudFlare 保护。

https://www.taylorwimpey.co.uk/sitemap could be using a CNAME configuration

https://www.taylorwimpey.co.uk/sitemap is using Cloudflare CDN/Proxy!

而带有 Selenium 的 Scrapy 无法处理它。但是Selenium本身可以处理这种情况并顺利克服保护。

import time
import pandas as pd 
# selenium 4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

#options to add as arguments
from selenium.webdriver.chrome.options import Options
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")

#chrome to stay open
option.add_experimental_option("detach", True)

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.taylorwimpey.co.uk/sitemap')
time.sleep(2)
URL=[]
for url in driver.find_elements(By.XPATH,'//*[@class="content-container"]/ul/li/a'):
    url=url.get_attribute('href')
    URL.append(url)
    #print(url)
df = pd.DataFrame(URL,columns=['Links'])
print(df)

输出：

                              Links
0     https://www.taylorwimpey.co.uk/new-homes/abera...
1     https://www.taylorwimpey.co.uk/new-homes/aberarth
2     https://www.taylorwimpey.co.uk/new-homes/aberavon
3     https://www.taylorwimpey.co.uk/new-homes/aberdare
4     https://www.taylorwimpey.co.uk/new-homes/aberdeen
...                                                 ...
1691   https://www.taylorwimpey.co.uk/new-homes/yateley
1692  https://www.taylorwimpey.co.uk/new-homes/yealm...
1693    https://www.taylorwimpey.co.uk/new-homes/yeovil
1694      https://www.taylorwimpey.co.uk/new-homes/york
1695  https://www.taylorwimpey.co.uk/new-homes/ystra...

[1696 rows x 1 columns]

chromedriverManager

403 HTTP 状态码未处理或不允许

403 HTTP status code is not handled or not allowed

python

scrapy

web-scraping