403 HTTP 状态码未处理或不允许
403 HTTP status code is not handled or not allowed
我正在尝试从 https://www.taylorwimpey.co.uk/sitemap 获取位置列表。它在我的浏览器中正常打开,但是当我尝试使用 scrapy 时,我什么也得不到并且:
2022-04-30 11:49:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-30 11:49:22 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.taylorwimpey.co.uk/sitemap> (referer: None)
2022-04-30 11:49:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.taylorwimpey.co.uk/sitemap>: HTTP status code is not handled or not allowed
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Closing spider (finished)
Starting csv blank line cleaning
2022-04-30 11:49:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 233,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2020,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'elapsed_time_seconds': 2.297067,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 30, 10, 49, 22, 111984),
'httpcompression/response_bytes': 3932,
'httpcompression/response_count': 1,
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/403': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 4, 30, 10, 49, 19, 814917)}
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Spider closed (finished)
我已经尝试在 setting/py 中进行调整,例如更改用户代理,但目前还没有奏效。
我的代码是:
import scrapy
from TaylorWimpey.items import TaylorwimpeyItem
from scrapy.http import TextResponse
from selenium import webdriver
class taylorwimpeySpider(scrapy.Spider):
name = "taylorwimpey"
allowed_domains = ["taylorwimpey.co.uk"]
start_urls = ["https://www.taylorwimpey.co.uk/sitemap"]
def __init__(self):
try:
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
except:
self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")
def parse(self, response): # build a list of all locations
self.driver.get(response.url)
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
url_list1 = []
for href in response1.xpath('//div[@class="content-container"]/ul/li/a/@href'):
url = response1.urljoin(href.extract())
url_list1.append(url)
print(url)
对做什么有什么看法?
您收到 403 是因为该网站受 CloudFlare 保护。
https://www.taylorwimpey.co.uk/sitemap could be using a CNAME configuration
https://www.taylorwimpey.co.uk/sitemap is using Cloudflare CDN/Proxy!
而带有 Selenium 的 Scrapy 无法处理它。但是Selenium本身可以处理这种情况并顺利克服保护。
import time
import pandas as pd
# selenium 4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
#options to add as arguments
from selenium.webdriver.chrome.options import Options
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.taylorwimpey.co.uk/sitemap')
time.sleep(2)
URL=[]
for url in driver.find_elements(By.XPATH,'//*[@class="content-container"]/ul/li/a'):
url=url.get_attribute('href')
URL.append(url)
#print(url)
df = pd.DataFrame(URL,columns=['Links'])
print(df)
输出:
Links
0 https://www.taylorwimpey.co.uk/new-homes/abera...
1 https://www.taylorwimpey.co.uk/new-homes/aberarth
2 https://www.taylorwimpey.co.uk/new-homes/aberavon
3 https://www.taylorwimpey.co.uk/new-homes/aberdare
4 https://www.taylorwimpey.co.uk/new-homes/aberdeen
... ...
1691 https://www.taylorwimpey.co.uk/new-homes/yateley
1692 https://www.taylorwimpey.co.uk/new-homes/yealm...
1693 https://www.taylorwimpey.co.uk/new-homes/yeovil
1694 https://www.taylorwimpey.co.uk/new-homes/york
1695 https://www.taylorwimpey.co.uk/new-homes/ystra...
[1696 rows x 1 columns]
我正在尝试从 https://www.taylorwimpey.co.uk/sitemap 获取位置列表。它在我的浏览器中正常打开,但是当我尝试使用 scrapy 时,我什么也得不到并且:
2022-04-30 11:49:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-30 11:49:22 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.taylorwimpey.co.uk/sitemap> (referer: None)
2022-04-30 11:49:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.taylorwimpey.co.uk/sitemap>: HTTP status code is not handled or not allowed
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Closing spider (finished)
Starting csv blank line cleaning
2022-04-30 11:49:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 233,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2020,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'elapsed_time_seconds': 2.297067,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 30, 10, 49, 22, 111984),
'httpcompression/response_bytes': 3932,
'httpcompression/response_count': 1,
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/403': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 4, 30, 10, 49, 19, 814917)}
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Spider closed (finished)
我已经尝试在 setting/py 中进行调整,例如更改用户代理,但目前还没有奏效。
我的代码是:
import scrapy
from TaylorWimpey.items import TaylorwimpeyItem
from scrapy.http import TextResponse
from selenium import webdriver
class taylorwimpeySpider(scrapy.Spider):
name = "taylorwimpey"
allowed_domains = ["taylorwimpey.co.uk"]
start_urls = ["https://www.taylorwimpey.co.uk/sitemap"]
def __init__(self):
try:
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
except:
self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")
def parse(self, response): # build a list of all locations
self.driver.get(response.url)
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
url_list1 = []
for href in response1.xpath('//div[@class="content-container"]/ul/li/a/@href'):
url = response1.urljoin(href.extract())
url_list1.append(url)
print(url)
对做什么有什么看法?
您收到 403 是因为该网站受 CloudFlare 保护。
https://www.taylorwimpey.co.uk/sitemap could be using a CNAME configuration
https://www.taylorwimpey.co.uk/sitemap is using Cloudflare CDN/Proxy!
而带有 Selenium 的 Scrapy 无法处理它。但是Selenium本身可以处理这种情况并顺利克服保护。
import time
import pandas as pd
# selenium 4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
#options to add as arguments
from selenium.webdriver.chrome.options import Options
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.taylorwimpey.co.uk/sitemap')
time.sleep(2)
URL=[]
for url in driver.find_elements(By.XPATH,'//*[@class="content-container"]/ul/li/a'):
url=url.get_attribute('href')
URL.append(url)
#print(url)
df = pd.DataFrame(URL,columns=['Links'])
print(df)
输出:
Links
0 https://www.taylorwimpey.co.uk/new-homes/abera...
1 https://www.taylorwimpey.co.uk/new-homes/aberarth
2 https://www.taylorwimpey.co.uk/new-homes/aberavon
3 https://www.taylorwimpey.co.uk/new-homes/aberdare
4 https://www.taylorwimpey.co.uk/new-homes/aberdeen
... ...
1691 https://www.taylorwimpey.co.uk/new-homes/yateley
1692 https://www.taylorwimpey.co.uk/new-homes/yealm...
1693 https://www.taylorwimpey.co.uk/new-homes/yeovil
1694 https://www.taylorwimpey.co.uk/new-homes/york
1695 https://www.taylorwimpey.co.uk/new-homes/ystra...
[1696 rows x 1 columns]