具有多个页面的 Scrapy
Scrapy with multiple pages
我创建了一个简单的 scrapy 项目,在其中,我从初始站点示例中获取了总页码。com/full。现在我需要从 example.com/page-2 开始抓取所有页面到 100(如果总页数是 100)。我该怎么做?
任何建议都会有所帮助。
代码:
import scrapy
class AllSpider(scrapy.Spider):
name = 'all'
allowed_domains = ['example.com']
start_urls = ['https://example.com/full/']
total_pages = 0
def parse(self, response):
total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()
#urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages))
print(total_pages)
更新#1:
我尝试使用那个 urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages))
但它不起作用,可能是我做错了什么。
更新 #2:
我已经像这样更改了我的代码
class AllSpider(scrapy.Spider):
name = 'all'
allowed_domains = ['sanet.st']
start_urls = ['https://sanet.st/full/']
total_pages = 0
def parse(self, response):
total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()
for page in range(2, int(total_pages)):
url = 'https://sanet.st/page-'+str(page)
yield scrapy.Request(url)
title = response.xpath('//*[@class="list_item_title"]/h2/a/span/text()').extract()
print(title)
但仍然循环只重复显示第一页标题。
我需要从不同的页面中提取标题并在提示中打印出来。
我该怎么做?
from scrapy.http import Request
def parse(self, response):
total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()
urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages))
for url in urls:
yield Request(url, callback=self.parse_page)
def parse_page(self, response):
# do the stuff
如 tutorial 所示的另一种方法是使用 yield response.follow(url, callback=self.parse_page)
,它直接支持相对 URL。
您必须搜索 'next_page' 对象并在其位于页面上时继续循环。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
class SanetSpider(scrapy.Spider):
name = 'sanet'
allowed_domains = ['sanet.st']
start_urls = ['https://sanet.st/full/']
def parse(self, response):
yield {
# Do something.
'result': response.xpath('//h3[@class="posts-results"]/text()').extract_first()
}
# next_page = /page-{}/ where {} number of page.
next_page = response.xpath('//a[@data-tip="Next page"]/@href').extract_first()
# next_page = https://sanet.st/page-{}/ where {} number of page.
next_page = response.urljoin(next_page)
# If next_page have value
if next_page:
# Recall parse with url https://sanet.st/page-{}/ where {} number of page.
yield scrapy.Request(url=next_page, callback=self.parse)
如果你 运行 使用“-o sanet.json”键你会得到下面的结果。
scrapy runspider sanet.py -o sanet.json
[
{"result": "results 1 - 15 from 651"},
{"result": "results 16 - 30 from 651"},
{"result": "results 31 - 45 from 651"},
...
etc.
...
{"result": "results 631 - 645 from 651"},
{"result": "results 646 - 651 from 651"}
]
我创建了一个简单的 scrapy 项目,在其中,我从初始站点示例中获取了总页码。com/full。现在我需要从 example.com/page-2 开始抓取所有页面到 100(如果总页数是 100)。我该怎么做?
任何建议都会有所帮助。
代码:
import scrapy
class AllSpider(scrapy.Spider):
name = 'all'
allowed_domains = ['example.com']
start_urls = ['https://example.com/full/']
total_pages = 0
def parse(self, response):
total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()
#urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages))
print(total_pages)
更新#1:
我尝试使用那个 urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages))
但它不起作用,可能是我做错了什么。
更新 #2: 我已经像这样更改了我的代码
class AllSpider(scrapy.Spider):
name = 'all'
allowed_domains = ['sanet.st']
start_urls = ['https://sanet.st/full/']
total_pages = 0
def parse(self, response):
total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()
for page in range(2, int(total_pages)):
url = 'https://sanet.st/page-'+str(page)
yield scrapy.Request(url)
title = response.xpath('//*[@class="list_item_title"]/h2/a/span/text()').extract()
print(title)
但仍然循环只重复显示第一页标题。 我需要从不同的页面中提取标题并在提示中打印出来。 我该怎么做?
from scrapy.http import Request
def parse(self, response):
total_pages = response.xpath("//body/section/div/section/div/div/ul/li[6]/a/text()").extract_first()
urls = ('https://example.com/page-{}'.format(i) for i in range(1,total_pages))
for url in urls:
yield Request(url, callback=self.parse_page)
def parse_page(self, response):
# do the stuff
如 tutorial 所示的另一种方法是使用 yield response.follow(url, callback=self.parse_page)
,它直接支持相对 URL。
您必须搜索 'next_page' 对象并在其位于页面上时继续循环。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
class SanetSpider(scrapy.Spider):
name = 'sanet'
allowed_domains = ['sanet.st']
start_urls = ['https://sanet.st/full/']
def parse(self, response):
yield {
# Do something.
'result': response.xpath('//h3[@class="posts-results"]/text()').extract_first()
}
# next_page = /page-{}/ where {} number of page.
next_page = response.xpath('//a[@data-tip="Next page"]/@href').extract_first()
# next_page = https://sanet.st/page-{}/ where {} number of page.
next_page = response.urljoin(next_page)
# If next_page have value
if next_page:
# Recall parse with url https://sanet.st/page-{}/ where {} number of page.
yield scrapy.Request(url=next_page, callback=self.parse)
如果你 运行 使用“-o sanet.json”键你会得到下面的结果。
scrapy runspider sanet.py -o sanet.json
[
{"result": "results 1 - 15 from 651"},
{"result": "results 16 - 30 from 651"},
{"result": "results 31 - 45 from 651"},
...
etc.
...
{"result": "results 631 - 645 from 651"},
{"result": "results 646 - 651 from 651"}
]