Scrapy -- 抓取一个页面并抓取下一页

Question

我正在尝试抓取 RateMyProfessors 以获取我的 items.py 文件中定义的教授统计数据：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field


class ScraperItem(Item):
    # define the fields for your item here like:
    numOfPages = Field() # number of pages of professors (usually 476)

    firstMiddleName = Field() # first (and middle) name
    lastName = Field() # last name
    numOfRatings = Field() # number of ratings
    overallQuality = Field() # numerical rating
    averageGrade = Field() # letter grade
    profile = Field() # url of professor profile

    pass

这是我的 scraper_spider.py 文件：

import scrapy

from scraper.items import ScraperItem
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor


class scraperSpider(scrapy.Spider):
    name = "scraper"
    allowed_domains = ["www.ratemyprofessors.com"]
    start_urls = [
    "http://www.ratemyprofessors.com/search.jsp?queryBy=teacherName&schoolName=pennsylvania+state+university"
    ]

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="nextLink"]')),callback='parse',follow=True),
        )

    def parse(self, response):
        # professors = []
        numOfPages = int(response.xpath('((//a[@class="step"])[last()])/text()').extract()[0])

        # create array of profile links
        profiles = response.xpath('//li[@class="listing PROFESSOR"]/a/@href').extract()

        # for each of those links
        for profile in profiles:
            # define item
            professor = ScraperItem();

            # add profile to professor
            professor["profile"] = profile

            # pass each page to the parse_profile() method
            request = scrapy.Request("http://www.ratemyprofessors.com"+profile,
                 callback=self.parse_profile)
            request.meta["professor"] = professor

            # add professor to array of professors
            yield request


    def parse_profile(self, response):
        professor = response.meta["professor"]

        if response.xpath('//*[@class="pfname"]'):
            # scrape each item from the link that was passed as an argument and add to current professor
            professor["firstMiddleName"] = response.xpath('//h1[@class="profname"]/span[@class="pfname"][1]/text()').extract() 

        if response.xpath('//*[@class="plname"]'):
            professor["lastName"] = response.xpath('//h1[@class="profname"]/span[@class="plname"]/text()').extract()

        if response.xpath('//*[@class="table-toggle rating-count active"]'):
            professor["numOfRatings"] = response.xpath('//div[@class="table-toggle rating-count active"]/text()').extract()

        if response.xpath('//*[@class="grade"]'):
            professor["overallQuality"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][1]/div[@class="grade"]/text()').extract()

        if response.xpath('//*[@class="grade"]'):
            professor["averageGrade"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][2]/div[@class="grade"]/text()').extract()

        return professor

# add string to rule.  linkextractor only gets "/showratings.." not "ratemyprofessors.com/showratings"

我的问题出在上面的scraper_spider.py文件中。蜘蛛程序应该转到 this RateMyProfessors 页面并转到每个教授并获取信息，然后返回目录并获取下一位教授的信息。在页面上没有更多教授可以抓取后，它应该找到 下一步按钮 的 href 值 并转到该页面并关注同样的方法。

我的抓取器能够抓取目录第 1 页上的所有教授，但之后停止，因为它不会转到下一页。

你能帮助我的爬虫成功找到并转到下一页吗？

我试图遵循 Whosebug 问题，但它太具体而无法使用。

Answer 1

如果您想使用 rules 属性，您的 scraperSpider 应该继承自 CrawlSpider。请参阅文档 here。还要注意文档中的这个警告

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Answer 2

我通过完全忽略规则并遵循 this doc 的 跟随链接 部分解决了我的问题。

Scrapy -- 抓取一个页面并抓取下一页

Scrapy -- Scraping a page and scraping next pages

python

scrapy

web-scraping

scrapy-spider