restrict_xpaths 参数没有过滤爬取的数据

restrict_xpaths parameter not filtering crawled data

我正在使用 Scrapy 1.0.5 并尝试抓取一系列文章以获取它们的标题和相应的 URL。我只想抓取 ID 为 devBodydiv 元素内的链接。考虑到这一点,我试图在规则中指定这样的限制,但我无法弄清楚为什么它仍然在该范围之外抓取链接:

from scrapy import Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from stack.items import StackItem

class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["dev.mysql.com"]
    start_urls = ["http://dev.mysql.com/tech-resources/articles/"]

    rules = (Rule(LinkExtractor(restrict_xpaths='//div[@id="devBody"]',), callback='parse'),)

    def parse(self, response):
        entries = response.xpath('//h4')
        items = []
        //using a counter here feels lame but I really couldn't think of a better 
        //way to avoid getting a list of all URLs and titles wrapped into a single object
        i = 0            
        for entry in entries:
            item = StackItem()
            item['title'] = entry.xpath('//a/text()').extract()[i]
            item['url'] = entry.xpath('//a/@href').extract()[i]
            yield item
            items.append(item)
            i += 1

为了理解这种行为,我使用 Chrome Dev Tools 使用 XPath 查询元素,对于给定的文章,一切正常 should. However, when I (try to) put the same sequence of steps in my code, things don't go the same way. It's fetching data outside of the div which ends up misplacing URL。 它确实说它获取了 57 个想要的 results 但随后出现了问题。

我不知道自己做错了什么。任何帮助将不胜感激。

您需要将 StackSpider class 基于 CrawlSpider class,它具有 rules 属性。参见docs here。您将需要重命名您的 parse() 方法并更改回调,因为 CrawlSpider 有它自己的 parse() ,如文档中所述。

或者B计划

CrawlSpider 对抓取此页面没有太大帮助。使用普通蜘蛛并循环遍历 'h4/a' 组合以获取所需信息非常简单。试试这个

for row in response.xpath('//div[@id="devBody"]/h4'):
    item['title'] = row.xpath('a/text()').extract()
    # get the full url
    item['url'] = response.urljoin(row.xpath('a/@href').extract_first())
    yield item