Scrapy - 仅从第一页抓取数据,而不是分页中的 "Next" 页
Scrapy - Scraping data from first page only not from "Next" page in pagination
下面的 scrapy 代码(取自一个博客 post)工作正常,仅从第一页抓取数据。我添加了 "Rule" 以从第二页提取数据,但它仍然只从第一页提取数据。
有什么建议吗?
代码如下:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TfawItem
class MasseffectSpider(CrawlSpider):
name = "massEffect"
allowed_domains = ["tfaw.com"]
start_urls = [
'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
]
rules = (
Rule(LinkExtractor(allow=(),
restrict_xpaths=('//div[@class="small-corners-light"][1]/table/tbody/tr[1]/td[2]/a[@class="regularlink"]',)),
callback='parse', follow=True),
)
def parse(self, response):
for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_detail_page)
pass
def parse_detail_page(self, response):
comic = TfawItem()
comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
comic['url'] = response.url
yield comic
您的蜘蛛几乎没有问题。首先,根据文档,您要覆盖 crawlspider 保留的 parse()
方法:
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
现在第二个问题是您的 LinkExtractor 没有提取任何内容。你的 xpath 在这里什么都不做。
我建议完全不要使用 CrawlSpider,而是像这样使用基础 scrapy.Spider:
import scrapy
class MySpider(scrapy.Spider):
name = 'massEffect'
start_urls = [
'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
]
def parse(self, response):
# parse all items
for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_detail_page)
# do next page
next_page = response.xpath("//a[contains(text(),'next page')]/@href").extract_first()
if next_page:
yield Request(response.urljoin(next_page), callback=self.parse)
def parse_detail_page(self, response):
comic = dict()
comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
comic['url'] = response.url
yield comic
下面的 scrapy 代码(取自一个博客 post)工作正常,仅从第一页抓取数据。我添加了 "Rule" 以从第二页提取数据,但它仍然只从第一页提取数据。
有什么建议吗?
代码如下:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TfawItem
class MasseffectSpider(CrawlSpider):
name = "massEffect"
allowed_domains = ["tfaw.com"]
start_urls = [
'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
]
rules = (
Rule(LinkExtractor(allow=(),
restrict_xpaths=('//div[@class="small-corners-light"][1]/table/tbody/tr[1]/td[2]/a[@class="regularlink"]',)),
callback='parse', follow=True),
)
def parse(self, response):
for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_detail_page)
pass
def parse_detail_page(self, response):
comic = TfawItem()
comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
comic['url'] = response.url
yield comic
您的蜘蛛几乎没有问题。首先,根据文档,您要覆盖 crawlspider 保留的 parse()
方法:
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
现在第二个问题是您的 LinkExtractor 没有提取任何内容。你的 xpath 在这里什么都不做。
我建议完全不要使用 CrawlSpider,而是像这样使用基础 scrapy.Spider:
import scrapy
class MySpider(scrapy.Spider):
name = 'massEffect'
start_urls = [
'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
]
def parse(self, response):
# parse all items
for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_detail_page)
# do next page
next_page = response.xpath("//a[contains(text(),'next page')]/@href").extract_first()
if next_page:
yield Request(response.urljoin(next_page), callback=self.parse)
def parse_detail_page(self, response):
comic = dict()
comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
comic['url'] = response.url
yield comic