我的 Scrapy CrawlSpider 在初始启动后停止 URL
My Scrapy CrawlSpider stops after the initial start URL
我的蜘蛛看起来像这样:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(CrawlSpider):
name = "craig"
# allowed_domains = ["support.t-mobile.com/community/phones-tablets-devices/"]
# start_urls = ["https://support.t-mobile.com/community/phones-tablets-devices/apple/content?start=20&filterID=contentstatus%5Bpublished%5D~objecttype~objecttype%5Bthread%5D"]
allowed_domains = ["reddit.com/rising/"]
start_urls = ["https://www.reddit.com/rising/"]
rules = [
Rule(LinkExtractor(allow=()), follow=True),
Rule(LinkExtractor(allow=()), callback='parse')
]
def parse(self, response):
hxs = Selector(response)
item = CraigslistSampleItem()
# item['link'] = hxs.xpath('//td[@class = "j-td-title"]/div/a/@href').extract()
# item['title'] = hxs.xpath('//td[@class = "j-td-title"]/div/a/text()').extract()
# item['content'] = hxs.xpath('//div[@class="jive-rendered-content"]/p/text()').extract()
item['URL'] = response.request.url
print item
如果您看到我没有指定任何允许的路径或限制的路径。这应该使蜘蛛爬过所有链接。谁能告诉我为什么我的 Spider 在初始页面后停止了。
控制台输出如下所示:
2016-10-25 14:36:38 [scrapy] INFO: Scrapy 1.0.3 started (bot:
craigslist_sample)
2016-10-25 14:36:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-10-25 14:36:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'craigslist_sample.spiders', 'SPIDER_MODULES': ['craigslist_sample.spiders'], 'BOT_NAME': 'craigslist_sample'}
2016-10-25 14:36:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-10-25 14:36:38 [boto] DEBUG: Retrieving credentials from metadata server.
2016-10-25 14:36:39 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-10-25 14:36:39 [boto] ERROR: Unable to read instance data, giving up
2016-10-25 14:36:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-10-25 14:36:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-10-25 14:36:39 [scrapy] INFO: Enabled item pipelines:
2016-10-25 14:36:39 [scrapy] INFO: Spider opened
2016-10-25 14:36:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-25 14:36:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-25 14:36:40 [scrapy] DEBUG: Crawled (200) <GET https://www.reddit.com/rising/> (referer: None)
2016-10-25 14:36:40 [scrapy] INFO: Closing spider (finished)
2016-10-25 14:36:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 219,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 24786,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 25, 18, 36, 40, 242330),
'log_count/DEBUG': 3,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 10, 25, 18, 36, 39, 525046)}
2016-10-25 14:36:40 [scrapy] INFO: Spider closed (finished)
我在你的代码中发现了三个问题:
(1) allowed_domains
。 allowed-domain
用于过滤站点链接,它本身应该是一个有效的 Domain Name。请更改为:
allowed_domains = ['reddit.com']
(2) parse
回调。 parse
是 default callback of response when their request do not specify callback. Please rename it, e.g., as parse_item
and rename parse
function to parse_item
as well. Please read the Warning:
Warning
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
(3) 关于 Rule
。我不明白你的规则。请阅读Rule documents
rule
Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
follow
follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False.
在您的代码中:
rules = [
# allow all links (in allowed_domains), and follow them, and not parse
Rule(LinkExtractor(allow=()), follow=True),
# allow all links (in allowed_domains), and not follow, and call parse to parse
Rule(LinkExtractor(allow=()), callback='parse')
]
显然,你的规则是冲突的,如果你想跟踪所有链接并解析它们,请使用以下代码:
rules = [
Rule(LinkExtractor(allow=()), follow=True, callback='parse_item')
]
这是适合我的示例代码:
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class RedditSpider(CrawlSpider):
name = "reddit"
allowed_domains = ["reddit.com"]
start_urls = ["https://www.reddit.com/rising/"]
rules = [
Rule(LinkExtractor(allow=()), follow=True, callback='parse_item')
]
def parse_item(self, response):
pass
希望我的解释对您有所帮助。谢谢!
我的蜘蛛看起来像这样:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(CrawlSpider):
name = "craig"
# allowed_domains = ["support.t-mobile.com/community/phones-tablets-devices/"]
# start_urls = ["https://support.t-mobile.com/community/phones-tablets-devices/apple/content?start=20&filterID=contentstatus%5Bpublished%5D~objecttype~objecttype%5Bthread%5D"]
allowed_domains = ["reddit.com/rising/"]
start_urls = ["https://www.reddit.com/rising/"]
rules = [
Rule(LinkExtractor(allow=()), follow=True),
Rule(LinkExtractor(allow=()), callback='parse')
]
def parse(self, response):
hxs = Selector(response)
item = CraigslistSampleItem()
# item['link'] = hxs.xpath('//td[@class = "j-td-title"]/div/a/@href').extract()
# item['title'] = hxs.xpath('//td[@class = "j-td-title"]/div/a/text()').extract()
# item['content'] = hxs.xpath('//div[@class="jive-rendered-content"]/p/text()').extract()
item['URL'] = response.request.url
print item
如果您看到我没有指定任何允许的路径或限制的路径。这应该使蜘蛛爬过所有链接。谁能告诉我为什么我的 Spider 在初始页面后停止了。
控制台输出如下所示:
2016-10-25 14:36:38 [scrapy] INFO: Scrapy 1.0.3 started (bot:
craigslist_sample)
2016-10-25 14:36:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-10-25 14:36:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'craigslist_sample.spiders', 'SPIDER_MODULES': ['craigslist_sample.spiders'], 'BOT_NAME': 'craigslist_sample'}
2016-10-25 14:36:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-10-25 14:36:38 [boto] DEBUG: Retrieving credentials from metadata server.
2016-10-25 14:36:39 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-10-25 14:36:39 [boto] ERROR: Unable to read instance data, giving up
2016-10-25 14:36:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-10-25 14:36:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-10-25 14:36:39 [scrapy] INFO: Enabled item pipelines:
2016-10-25 14:36:39 [scrapy] INFO: Spider opened
2016-10-25 14:36:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-25 14:36:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-25 14:36:40 [scrapy] DEBUG: Crawled (200) <GET https://www.reddit.com/rising/> (referer: None)
2016-10-25 14:36:40 [scrapy] INFO: Closing spider (finished)
2016-10-25 14:36:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 219,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 24786,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 25, 18, 36, 40, 242330),
'log_count/DEBUG': 3,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 10, 25, 18, 36, 39, 525046)}
2016-10-25 14:36:40 [scrapy] INFO: Spider closed (finished)
我在你的代码中发现了三个问题:
(1) allowed_domains
。 allowed-domain
用于过滤站点链接,它本身应该是一个有效的 Domain Name。请更改为:
allowed_domains = ['reddit.com']
(2) parse
回调。 parse
是 default callback of response when their request do not specify callback. Please rename it, e.g., as parse_item
and rename parse
function to parse_item
as well. Please read the Warning:
Warning
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
(3) 关于 Rule
。我不明白你的规则。请阅读Rule documents
rule
Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
follow
follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False.
在您的代码中:
rules = [
# allow all links (in allowed_domains), and follow them, and not parse
Rule(LinkExtractor(allow=()), follow=True),
# allow all links (in allowed_domains), and not follow, and call parse to parse
Rule(LinkExtractor(allow=()), callback='parse')
]
显然,你的规则是冲突的,如果你想跟踪所有链接并解析它们,请使用以下代码:
rules = [
Rule(LinkExtractor(allow=()), follow=True, callback='parse_item')
]
这是适合我的示例代码:
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class RedditSpider(CrawlSpider):
name = "reddit"
allowed_domains = ["reddit.com"]
start_urls = ["https://www.reddit.com/rising/"]
rules = [
Rule(LinkExtractor(allow=()), follow=True, callback='parse_item')
]
def parse_item(self, response):
pass
希望我的解释对您有所帮助。谢谢!