下一页的 Scrapy 爬行循环

Question

你好，我正在尝试使用单词抓取工具和爬虫，但是我不明白为什么我的代码不会转到下一页并循环。

import scrapy 
from scrapy import*

    import scrapy 
from scrapy import*

class SpiderSpider(scrapy.Spider):
    name = 'spider'
    start_urls = ['https://www.thehousedirectory.com/category/interior-designers-architects/london-interior-designers/']
            
     
    def parse(self, response):

        allbuyers = response.xpath('//div[@class="company-details"]')

        for buyers in allbuyers:

            name = buyers.xpath('.//div/a/h2/text()').extract_first()
            email = buyers.xpath('.//p/a[contains(text(),"@")]/text()').extract_first()
            
            yield{

                'Name' : name,
                'Email' : email,

            }  
        
        next_url = response.css('#main > div > nav > a.next.page-numbers')

        if next_url:
            print("test")
            url = response.xpath("href").extract()
            yield scrapy.Request(url, self.parse)

Answer 1

您为获得下一页所做的一切没有任何意义。具体来说，这一行我的意思是 url = response.xpath("href").extract()

这是您的蜘蛛的修改版本：

class HouseDirectorySpider(scrapy.Spider):
    name = 'thehousedirectory'
    start_urls = ['https://www.thehousedirectory.com/category/interior-designers-architects/london-interior-designers/']
            
    def parse(self, response):
        for buyers in response.xpath('//*[@class="company-details"]'):
            yield {
                'Name' : buyers.xpath('.//*[@class="heading"]/a/h2/text()').get(),
                'Email' : buyers.xpath('.//p/a[starts-with(@href,"mailto:")]/text()').get(),
            }  
        
        next_url = response.css('.custom-pagination > a.next:contains("Next Page")')
        if next_url:
            url = next_url.css("::attr(href)").get()
            yield scrapy.Request(url,callback=self.parse)

Answer 2

因此，根据您的代码，next_url 变量 returns 没有任何循环会中断，但您可以通过这种方式

.
.
next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()

if next_url:
    yield scrapy.Request(next_url, self.parse)

此代码也适用于 scraper 无法找到 NEXT PAGE DOM 元素的情况。

下一页的 Scrapy 爬行循环

Scrapy crawl loop for next page

python

scrapy

web-scraping

scrapy-shell