Scrapy -- 抓取一个页面并抓取下一页
Scrapy -- Scraping a page and scraping next pages
我正在尝试抓取 RateMyProfessors 以获取我的 items.py 文件中定义的教授统计数据:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class ScraperItem(Item):
# define the fields for your item here like:
numOfPages = Field() # number of pages of professors (usually 476)
firstMiddleName = Field() # first (and middle) name
lastName = Field() # last name
numOfRatings = Field() # number of ratings
overallQuality = Field() # numerical rating
averageGrade = Field() # letter grade
profile = Field() # url of professor profile
pass
这是我的 scraper_spider.py 文件:
import scrapy
from scraper.items import ScraperItem
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
class scraperSpider(scrapy.Spider):
name = "scraper"
allowed_domains = ["www.ratemyprofessors.com"]
start_urls = [
"http://www.ratemyprofessors.com/search.jsp?queryBy=teacherName&schoolName=pennsylvania+state+university"
]
rules = (
Rule(LinkExtractor(restrict_xpaths=('//a[@class="nextLink"]')),callback='parse',follow=True),
)
def parse(self, response):
# professors = []
numOfPages = int(response.xpath('((//a[@class="step"])[last()])/text()').extract()[0])
# create array of profile links
profiles = response.xpath('//li[@class="listing PROFESSOR"]/a/@href').extract()
# for each of those links
for profile in profiles:
# define item
professor = ScraperItem();
# add profile to professor
professor["profile"] = profile
# pass each page to the parse_profile() method
request = scrapy.Request("http://www.ratemyprofessors.com"+profile,
callback=self.parse_profile)
request.meta["professor"] = professor
# add professor to array of professors
yield request
def parse_profile(self, response):
professor = response.meta["professor"]
if response.xpath('//*[@class="pfname"]'):
# scrape each item from the link that was passed as an argument and add to current professor
professor["firstMiddleName"] = response.xpath('//h1[@class="profname"]/span[@class="pfname"][1]/text()').extract()
if response.xpath('//*[@class="plname"]'):
professor["lastName"] = response.xpath('//h1[@class="profname"]/span[@class="plname"]/text()').extract()
if response.xpath('//*[@class="table-toggle rating-count active"]'):
professor["numOfRatings"] = response.xpath('//div[@class="table-toggle rating-count active"]/text()').extract()
if response.xpath('//*[@class="grade"]'):
professor["overallQuality"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][1]/div[@class="grade"]/text()').extract()
if response.xpath('//*[@class="grade"]'):
professor["averageGrade"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][2]/div[@class="grade"]/text()').extract()
return professor
# add string to rule. linkextractor only gets "/showratings.." not "ratemyprofessors.com/showratings"
我的问题出在上面的scraper_spider.py文件中。蜘蛛程序应该转到 this RateMyProfessors 页面并转到每个教授并获取信息,然后返回目录并获取下一位教授的信息。在页面上没有更多教授可以抓取后,它应该找到 下一步按钮 的 href 值 并转到该页面并关注同样的方法。
我的抓取器能够抓取目录第 1 页上的所有教授,但之后停止,因为它不会转到下一页。
你能帮助我的爬虫成功找到并转到下一页吗?
我试图遵循 Whosebug 问题,但它太具体而无法使用。
如果您想使用 rules
属性,您的 scraperSpider
应该继承自 CrawlSpider
。请参阅文档 here。还要注意文档中的这个警告
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
我通过完全忽略规则并遵循 this doc 的 跟随链接 部分解决了我的问题。
我正在尝试抓取 RateMyProfessors 以获取我的 items.py 文件中定义的教授统计数据:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class ScraperItem(Item):
# define the fields for your item here like:
numOfPages = Field() # number of pages of professors (usually 476)
firstMiddleName = Field() # first (and middle) name
lastName = Field() # last name
numOfRatings = Field() # number of ratings
overallQuality = Field() # numerical rating
averageGrade = Field() # letter grade
profile = Field() # url of professor profile
pass
这是我的 scraper_spider.py 文件:
import scrapy
from scraper.items import ScraperItem
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
class scraperSpider(scrapy.Spider):
name = "scraper"
allowed_domains = ["www.ratemyprofessors.com"]
start_urls = [
"http://www.ratemyprofessors.com/search.jsp?queryBy=teacherName&schoolName=pennsylvania+state+university"
]
rules = (
Rule(LinkExtractor(restrict_xpaths=('//a[@class="nextLink"]')),callback='parse',follow=True),
)
def parse(self, response):
# professors = []
numOfPages = int(response.xpath('((//a[@class="step"])[last()])/text()').extract()[0])
# create array of profile links
profiles = response.xpath('//li[@class="listing PROFESSOR"]/a/@href').extract()
# for each of those links
for profile in profiles:
# define item
professor = ScraperItem();
# add profile to professor
professor["profile"] = profile
# pass each page to the parse_profile() method
request = scrapy.Request("http://www.ratemyprofessors.com"+profile,
callback=self.parse_profile)
request.meta["professor"] = professor
# add professor to array of professors
yield request
def parse_profile(self, response):
professor = response.meta["professor"]
if response.xpath('//*[@class="pfname"]'):
# scrape each item from the link that was passed as an argument and add to current professor
professor["firstMiddleName"] = response.xpath('//h1[@class="profname"]/span[@class="pfname"][1]/text()').extract()
if response.xpath('//*[@class="plname"]'):
professor["lastName"] = response.xpath('//h1[@class="profname"]/span[@class="plname"]/text()').extract()
if response.xpath('//*[@class="table-toggle rating-count active"]'):
professor["numOfRatings"] = response.xpath('//div[@class="table-toggle rating-count active"]/text()').extract()
if response.xpath('//*[@class="grade"]'):
professor["overallQuality"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][1]/div[@class="grade"]/text()').extract()
if response.xpath('//*[@class="grade"]'):
professor["averageGrade"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][2]/div[@class="grade"]/text()').extract()
return professor
# add string to rule. linkextractor only gets "/showratings.." not "ratemyprofessors.com/showratings"
我的问题出在上面的scraper_spider.py文件中。蜘蛛程序应该转到 this RateMyProfessors 页面并转到每个教授并获取信息,然后返回目录并获取下一位教授的信息。在页面上没有更多教授可以抓取后,它应该找到 下一步按钮 的 href 值 并转到该页面并关注同样的方法。
我的抓取器能够抓取目录第 1 页上的所有教授,但之后停止,因为它不会转到下一页。
你能帮助我的爬虫成功找到并转到下一页吗?
我试图遵循
如果您想使用 rules
属性,您的 scraperSpider
应该继承自 CrawlSpider
。请参阅文档 here。还要注意文档中的这个警告
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
我通过完全忽略规则并遵循 this doc 的 跟随链接 部分解决了我的问题。