SgmlLinkExtractor 未显示结果或跟随 link

Question

我无法完全理解 SGML Link 提取器的工作原理。在用Scrapy做爬虫的时候，我可以使用特定的URLS从links中成功提取数据。问题是使用规则在特定 URL.

中跟随下一页 link

我认为问题出在allow()属性上。将规则添加到代码时，结果不会显示在命令行中，并且不会遵循下一页的 link。

非常感谢任何帮助。

这是代码...

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.spiders import Rule

from tutorial.items import TutorialItem

class AllGigsSpider(CrawlSpider):
    name = "allGigs"
    allowed_domains = ["http://www.allgigs.co.uk/"]
    start_urls = [
        "http://www.allgigs.co.uk/whats_on/London/clubbing-1.html",
        "http://www.allgigs.co.uk/whats_on/London/festivals-1.html",
        "http://www.allgigs.co.uk/whats_on/London/comedy-1.html",
        "http://www.allgigs.co.uk/whats_on/London/theatre_and_opera-1.html",
        "http://www.allgigs.co.uk/whats_on/London/dance_and_ballet-1.html"
    ]    
    rules = (Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[@class="more"]',)), callback="parse_me", follow= True),
    )

    def parse_me(self, response):
        hxs = HtmlXPathSelector(response)
        infos = hxs.xpath('//div[@class="entry vevent"]')
        items = []
        for info in infos:
            item = TutorialItem()
            item ['artist'] = hxs.xpath('//span[@class="summary"]//text()').extract()
            item ['date'] = hxs.xpath('//abbr[@class="dtstart dtend"]//text()').extract()
            item ['endDate'] = hxs.xpath('//abbr[@class="dtend"]//text()').extract()            
            item ['startDate'] = hxs.xpath('//abbr[@class="dtstart"]//text()').extract()
            items.append(item)
        return items
        print items

Answer 1

问题出在 restrict_xpaths - 它应该指向 link 提取器应该寻找 link 的块。根本不指定 allow：

rules = [
    Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), 
         callback="parse_me", 
         follow=True),
]

你需要修复你的 allowed_domains:

allowed_domains = ["www.allgigs.co.uk"]

另请注意，parse_me() 回调中的 print items 无法访问，因为它位于 return 语句之后。并且，在循环中，不应使用 hxs 应用 XPath 表达式，表达式应在 info 上下文中使用。你可以简化 parse_me():

def parse_me(self, response):
    for info in response.xpath('//div[@class="entry vevent"]'):
        item = TutorialItem()
        item['artist'] = info.xpath('.//span[@class="summary"]//text()').extract()
        item['date'] = info.xpath('.//abbr[@class="dtstart dtend"]//text()').extract()
        item['endDate'] = info.xpath('.//abbr[@class="dtend"]//text()').extract()            
        item['startDate'] = info.xpath('.//abbr[@class="dtstart"]//text()').extract()
        yield item

SgmlLinkExtractor 未显示结果或跟随 link

SgmlLinkExtractor not displaying results or following link

python

sgml

web-crawler

scrapy

scrapy-spider