使用单个scrapy spider逐页提取数据
Extract data by pages one by one using single scrapy spider
我正在尝试从 goodreads.
中提取数据
我想使用一些时间延迟一个一个地抓取页面。
我的蜘蛛看起来像:
import scrapy
import unidecode
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
class ElementSpider(scrapy.Spider):
name = 'books'
download_delay = 3
allowed_domains = ["https://www.goodreads.com"]
start_urls = ["https://www.goodreads.com/list/show/19793.I_Marked_My_Calendar_For_This_Book_s_Release?page=1",
]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="next_page"]',)), callback="parse", follow= True),)
def parse(self, response):
for href in response.xpath('//div[@id="all_votes"]/table[@class="tableList js-dataTooltip"]/tr/td[2]/div[@class="js-tooltipTrigger tooltipTrigger"]/a/@href'):
full_url = response.urljoin(href.extract())
print full_url
yield scrapy.Request(full_url, callback=self.parse_books)
next_page = response.xpath('.//a[@class="button next"]/@href').extract()
if next_page:
next_href = next_page[0]
print next_href
next_page_url = 'https://www.goodreads.com' + next_href
request = scrapy.Request(url=next_page_url)
yield request
def parse_books(self, response):
yield{
'url': response.url,
'title':response.xpath('//div[@id="metacol"]/h1[@class="bookTitle"]/text()').exract(),
}
请建议我这样做可以通过运行 spider一次提取所有页面数据。
我修改了我的代码并且它有效。变化是 -
request = scrapy.Request(url=next_page_url)
应该是
request = scrapy.Request(next_page_url, self.parse)
当我评论 allowed_domains = ["https://www.goodreads.com"]
时效果很好,否则 json 文件中没有保存数据。
任何人都可以解释为什么吗?
看起来 allowed_domains
需要在 the documentation 中有更好的解释,但是如果您检查示例,那里的域结构应该像 domain.com
,因此避免方案和不必要的子域(www
是一个子域)
我正在尝试从 goodreads.
中提取数据我想使用一些时间延迟一个一个地抓取页面。
我的蜘蛛看起来像:
import scrapy
import unidecode
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
class ElementSpider(scrapy.Spider):
name = 'books'
download_delay = 3
allowed_domains = ["https://www.goodreads.com"]
start_urls = ["https://www.goodreads.com/list/show/19793.I_Marked_My_Calendar_For_This_Book_s_Release?page=1",
]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="next_page"]',)), callback="parse", follow= True),)
def parse(self, response):
for href in response.xpath('//div[@id="all_votes"]/table[@class="tableList js-dataTooltip"]/tr/td[2]/div[@class="js-tooltipTrigger tooltipTrigger"]/a/@href'):
full_url = response.urljoin(href.extract())
print full_url
yield scrapy.Request(full_url, callback=self.parse_books)
next_page = response.xpath('.//a[@class="button next"]/@href').extract()
if next_page:
next_href = next_page[0]
print next_href
next_page_url = 'https://www.goodreads.com' + next_href
request = scrapy.Request(url=next_page_url)
yield request
def parse_books(self, response):
yield{
'url': response.url,
'title':response.xpath('//div[@id="metacol"]/h1[@class="bookTitle"]/text()').exract(),
}
请建议我这样做可以通过运行 spider一次提取所有页面数据。
我修改了我的代码并且它有效。变化是 -
request = scrapy.Request(url=next_page_url)
应该是
request = scrapy.Request(next_page_url, self.parse)
当我评论 allowed_domains = ["https://www.goodreads.com"]
时效果很好,否则 json 文件中没有保存数据。
任何人都可以解释为什么吗?
看起来 allowed_domains
需要在 the documentation 中有更好的解释,但是如果您检查示例,那里的域结构应该像 domain.com
,因此避免方案和不必要的子域(www
是一个子域)