Scrapy 返回 403 错误(禁止)
Scrapy returning 403 error (Forbidden)
我对 Scrapy 和使用 Python 都很陌生。过去,我设法获得了一个 Scrapy 工作的最小示例,但从那以后就没有使用过它。
与此同时,一个新版本已经发布(我想我上次使用的那个是 0.24
)而且我无法弄清楚为什么无论我访问哪个网站我都会收到 403 错误尝试抓取。
当然,我还没有深入研究中间件 and/or 管道,但我希望在进一步探索之前能够得到一个最小的示例 运行。话虽如此,这是我当前的代码:
items.py
import scrapy
class StackItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
stack_spider.py
#derived from https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem
class StackSpider(Spider):
handle_httpstatus_list = [403, 404] #kind of out of desperation. Is it serving any purpose?
name = "stack"
allowed_domains = ["whosebug.com"]
start_urls = [
"http://whosebug.com/questions?pagesize=50&sort=newest",
]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
self.log(question)
item = StackItem()
item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
yield item
输出
(pyplayground) 22:39 ~/stack $ scrapy crawl stack
2016-03-07 22:39:38 [scrapy] INFO: Scrapy 1.0.5 started (bot: stack)
2016-03-07 22:39:38 [scrapy] INFO: Optional features available: ssl, http11
2016-03-07 22:39:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'stack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'RETRY_TIMES': 5, 'BOT_NAME': 'stack', 'RET
RY_HTTP_CODES': [500, 502, 503, 504, 400, 403, 404, 408], 'DOWNLOAD_DELAY': 3}
2016-03-07 22:39:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-07 22:39:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddlewa
re, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-07 22:39:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-07 22:39:39 [scrapy] INFO: Enabled item pipelines:
2016-03-07 22:39:39 [scrapy] INFO: Spider opened
2016-03-07 22:39:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-07 22:39:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-07 22:39:39 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 1 times): 403 Forbidden
2016-03-07 22:39:42 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 2 times): 403 Forbidden
2016-03-07 22:39:47 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 3 times): 403 Forbidden
2016-03-07 22:39:51 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 4 times): 403 Forbidden
2016-03-07 22:39:55 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 5 times): 403 Forbidden
2016-03-07 22:39:58 [scrapy] DEBUG: Gave up retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 6 times): 403 Forbidden
2016-03-07 22:39:58 [scrapy] DEBUG: Crawled (403) <GET http://whosebug.com/questions?pagesize=50&sort=newest> (referer: None)
2016-03-07 22:39:58 [scrapy] INFO: Closing spider (finished)
2016-03-07 22:39:58 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1488,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 6624,
'downloader/response_count': 6,
'downloader/response_status_count/403': 6,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 7, 22, 39, 58, 458578),
'log_count/DEBUG': 8,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'start_time': datetime.datetime(2016, 3, 7, 22, 39, 39, 607472)}
2016-03-07 22:39:58 [scrapy] INFO: Spider closed (finished)
很明显你在使用代理。检查并适当设置您的 http_proxy
、https_proxy
环境变量。交叉检查 curl
是否可以从终端获得 URL。
我对 Scrapy 和使用 Python 都很陌生。过去,我设法获得了一个 Scrapy 工作的最小示例,但从那以后就没有使用过它。
与此同时,一个新版本已经发布(我想我上次使用的那个是 0.24
)而且我无法弄清楚为什么无论我访问哪个网站我都会收到 403 错误尝试抓取。
当然,我还没有深入研究中间件 and/or 管道,但我希望在进一步探索之前能够得到一个最小的示例 运行。话虽如此,这是我当前的代码:
items.py
import scrapy
class StackItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
stack_spider.py
#derived from https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem
class StackSpider(Spider):
handle_httpstatus_list = [403, 404] #kind of out of desperation. Is it serving any purpose?
name = "stack"
allowed_domains = ["whosebug.com"]
start_urls = [
"http://whosebug.com/questions?pagesize=50&sort=newest",
]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
self.log(question)
item = StackItem()
item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
yield item
输出
(pyplayground) 22:39 ~/stack $ scrapy crawl stack
2016-03-07 22:39:38 [scrapy] INFO: Scrapy 1.0.5 started (bot: stack)
2016-03-07 22:39:38 [scrapy] INFO: Optional features available: ssl, http11
2016-03-07 22:39:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'stack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'RETRY_TIMES': 5, 'BOT_NAME': 'stack', 'RET
RY_HTTP_CODES': [500, 502, 503, 504, 400, 403, 404, 408], 'DOWNLOAD_DELAY': 3}
2016-03-07 22:39:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-07 22:39:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddlewa
re, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-07 22:39:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-07 22:39:39 [scrapy] INFO: Enabled item pipelines:
2016-03-07 22:39:39 [scrapy] INFO: Spider opened
2016-03-07 22:39:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-07 22:39:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-07 22:39:39 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 1 times): 403 Forbidden
2016-03-07 22:39:42 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 2 times): 403 Forbidden
2016-03-07 22:39:47 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 3 times): 403 Forbidden
2016-03-07 22:39:51 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 4 times): 403 Forbidden
2016-03-07 22:39:55 [scrapy] DEBUG: Retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 5 times): 403 Forbidden
2016-03-07 22:39:58 [scrapy] DEBUG: Gave up retrying <GET http://whosebug.com/questions?pagesize=50&sort=newest> (failed 6 times): 403 Forbidden
2016-03-07 22:39:58 [scrapy] DEBUG: Crawled (403) <GET http://whosebug.com/questions?pagesize=50&sort=newest> (referer: None)
2016-03-07 22:39:58 [scrapy] INFO: Closing spider (finished)
2016-03-07 22:39:58 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1488,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 6624,
'downloader/response_count': 6,
'downloader/response_status_count/403': 6,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 7, 22, 39, 58, 458578),
'log_count/DEBUG': 8,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'start_time': datetime.datetime(2016, 3, 7, 22, 39, 39, 607472)}
2016-03-07 22:39:58 [scrapy] INFO: Spider closed (finished)
很明显你在使用代理。检查并适当设置您的 http_proxy
、https_proxy
环境变量。交叉检查 curl
是否可以从终端获得 URL。