我的 Scrapy CrawlSpider 在初始启动后停止 URL

Question

我的蜘蛛看起来像这样：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from craigslist_sample.items import CraigslistSampleItem

class MySpider(CrawlSpider):
name = "craig"
# allowed_domains = ["support.t-mobile.com/community/phones-tablets-devices/"]
# start_urls = ["https://support.t-mobile.com/community/phones-tablets-devices/apple/content?start=20&filterID=contentstatus%5Bpublished%5D~objecttype~objecttype%5Bthread%5D"]

allowed_domains = ["reddit.com/rising/"]
start_urls = ["https://www.reddit.com/rising/"]


rules = [
Rule(LinkExtractor(allow=()), follow=True),
Rule(LinkExtractor(allow=()), callback='parse')
]

def parse(self, response):
    hxs = Selector(response) 
    item = CraigslistSampleItem()
    # item['link'] = hxs.xpath('//td[@class = "j-td-title"]/div/a/@href').extract()
    # item['title'] = hxs.xpath('//td[@class = "j-td-title"]/div/a/text()').extract()
    # item['content'] = hxs.xpath('//div[@class="jive-rendered-content"]/p/text()').extract()
    item['URL'] = response.request.url
    print item

如果您看到我没有指定任何允许的路径或限制的路径。这应该使蜘蛛爬过所有链接。谁能告诉我为什么我的 Spider 在初始页面后停止了。

控制台输出如下所示：

2016-10-25 14:36:38 [scrapy] INFO: Scrapy 1.0.3 started (bot: 

craigslist_sample)
2016-10-25 14:36:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-10-25 14:36:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'craigslist_sample.spiders', 'SPIDER_MODULES': ['craigslist_sample.spiders'], 'BOT_NAME': 'craigslist_sample'}
2016-10-25 14:36:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-10-25 14:36:38 [boto] DEBUG: Retrieving credentials from metadata server.
2016-10-25 14:36:39 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2016-10-25 14:36:39 [boto] ERROR: Unable to read instance data, giving up
2016-10-25 14:36:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-10-25 14:36:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-10-25 14:36:39 [scrapy] INFO: Enabled item pipelines: 
2016-10-25 14:36:39 [scrapy] INFO: Spider opened
2016-10-25 14:36:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-25 14:36:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-25 14:36:40 [scrapy] DEBUG: Crawled (200) <GET https://www.reddit.com/rising/> (referer: None)
2016-10-25 14:36:40 [scrapy] INFO: Closing spider (finished)
2016-10-25 14:36:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 219,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 24786,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 10, 25, 18, 36, 40, 242330),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 10, 25, 18, 36, 39, 525046)}
2016-10-25 14:36:40 [scrapy] INFO: Spider closed (finished)

Answer 1

我在你的代码中发现了三个问题：

(1) allowed_domains。 allowed-domain 用于过滤站点链接，它本身应该是一个有效的 Domain Name。请更改为：

allowed_domains = ['reddit.com']

(2) parse 回调。 parse 是 default callback of response when their request do not specify callback. Please rename it, e.g., as parse_item and rename parse function to parse_item as well. Please read the Warning:

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

(3) 关于 Rule。我不明白你的规则。请阅读Rule documents

rule

Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.

follow

follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False.

在您的代码中：

rules = [
    # allow all links (in allowed_domains), and follow them, and not parse
    Rule(LinkExtractor(allow=()), follow=True),
    # allow all links (in allowed_domains), and not follow, and call parse to parse
    Rule(LinkExtractor(allow=()), callback='parse')
    ]

显然，你的规则是冲突的，如果你想跟踪所有链接并解析它们，请使用以下代码：

rules = [
    Rule(LinkExtractor(allow=()), follow=True, callback='parse_item')
]

这是适合我的示例代码：

# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class RedditSpider(CrawlSpider):
    name = "reddit"
    allowed_domains = ["reddit.com"]
    start_urls = ["https://www.reddit.com/rising/"]

    rules = [
        Rule(LinkExtractor(allow=()), follow=True, callback='parse_item')
    ]

    def parse_item(self, response):
        pass

希望我的解释对您有所帮助。谢谢！

我的 Scrapy CrawlSpider 在初始启动后停止 URL

My Scrapy CrawlSpider stops after the initial start URL

python

web-scraping

scrapy-spider