下一页的 Scrapy 爬行循环
Scrapy crawl loop for next page
你好,我正在尝试使用单词抓取工具和爬虫,但是我不明白为什么我的代码不会转到下一页并循环。
import scrapy
from scrapy import*
import scrapy
from scrapy import*
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = ['https://www.thehousedirectory.com/category/interior-designers-architects/london-interior-designers/']
def parse(self, response):
allbuyers = response.xpath('//div[@class="company-details"]')
for buyers in allbuyers:
name = buyers.xpath('.//div/a/h2/text()').extract_first()
email = buyers.xpath('.//p/a[contains(text(),"@")]/text()').extract_first()
yield{
'Name' : name,
'Email' : email,
}
next_url = response.css('#main > div > nav > a.next.page-numbers')
if next_url:
print("test")
url = response.xpath("href").extract()
yield scrapy.Request(url, self.parse)
您为获得下一页所做的一切没有任何意义。具体来说,这一行我的意思是 url = response.xpath("href").extract()
这是您的蜘蛛的修改版本:
class HouseDirectorySpider(scrapy.Spider):
name = 'thehousedirectory'
start_urls = ['https://www.thehousedirectory.com/category/interior-designers-architects/london-interior-designers/']
def parse(self, response):
for buyers in response.xpath('//*[@class="company-details"]'):
yield {
'Name' : buyers.xpath('.//*[@class="heading"]/a/h2/text()').get(),
'Email' : buyers.xpath('.//p/a[starts-with(@href,"mailto:")]/text()').get(),
}
next_url = response.css('.custom-pagination > a.next:contains("Next Page")')
if next_url:
url = next_url.css("::attr(href)").get()
yield scrapy.Request(url,callback=self.parse)
因此,根据您的代码,next_url
变量 returns 没有任何循环会中断,但您可以通过这种方式
.
.
next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
if next_url:
yield scrapy.Request(next_url, self.parse)
此代码也适用于 scraper 无法找到 NEXT PAGE
DOM 元素的情况。
你好,我正在尝试使用单词抓取工具和爬虫,但是我不明白为什么我的代码不会转到下一页并循环。
import scrapy
from scrapy import*
import scrapy
from scrapy import*
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = ['https://www.thehousedirectory.com/category/interior-designers-architects/london-interior-designers/']
def parse(self, response):
allbuyers = response.xpath('//div[@class="company-details"]')
for buyers in allbuyers:
name = buyers.xpath('.//div/a/h2/text()').extract_first()
email = buyers.xpath('.//p/a[contains(text(),"@")]/text()').extract_first()
yield{
'Name' : name,
'Email' : email,
}
next_url = response.css('#main > div > nav > a.next.page-numbers')
if next_url:
print("test")
url = response.xpath("href").extract()
yield scrapy.Request(url, self.parse)
您为获得下一页所做的一切没有任何意义。具体来说,这一行我的意思是 url = response.xpath("href").extract()
这是您的蜘蛛的修改版本:
class HouseDirectorySpider(scrapy.Spider):
name = 'thehousedirectory'
start_urls = ['https://www.thehousedirectory.com/category/interior-designers-architects/london-interior-designers/']
def parse(self, response):
for buyers in response.xpath('//*[@class="company-details"]'):
yield {
'Name' : buyers.xpath('.//*[@class="heading"]/a/h2/text()').get(),
'Email' : buyers.xpath('.//p/a[starts-with(@href,"mailto:")]/text()').get(),
}
next_url = response.css('.custom-pagination > a.next:contains("Next Page")')
if next_url:
url = next_url.css("::attr(href)").get()
yield scrapy.Request(url,callback=self.parse)
因此,根据您的代码,next_url
变量 returns 没有任何循环会中断,但您可以通过这种方式
.
.
next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
if next_url:
yield scrapy.Request(next_url, self.parse)
此代码也适用于 scraper 无法找到 NEXT PAGE
DOM 元素的情况。