Scrapy分页:无法分页

Scrapy pagination: Unable to paginate

首先感谢您阅读本文。

我一直在使用 Python 和 scrapy 来抓取次要数据,但是,我想提取一些额外的信息,但我卡在了分页上。 该网站是https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html

元素是

<span class="jslink pg-btn page-next" data-href="https://home.mobile.de/regional/baden-württemberg/2.html" title="Zur nächsten Seite">&nbsp;</span>

element

我可以在 Rule(LinkExtractor(restrict_xpaths="") 中使用什么 xpath 表达式?

我正在使用抓取模板。 到目前为止我的代码:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Baden1Spider(CrawlSpider):
    name = 'baden1'
    allowed_domains = ['home.mobile.de']
    start_urls = ['https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html?fbclid=IwAR0MpRTx1TrrrBdg2cKr5E08QiP4fE-pjOAwb7_UsEytToJmWFEfpdD6X0w/']

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='box']/div[@class='row ']"), callback='parse_item', follow=True),
        # Rule(LinkExtractor(restrict_xpaths="//span[@class='jslink pg-btn page-next']"))
    )

    def parse_item(self, response):
        yield{
            'Dealer Name': response.xpath("//address[@class='fullAddress']/strong/text()").get(),
            'Street': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text())").get(),
            'ZIP Code': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text()/following::text()[1])").get().split()[0],
            'City': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text()/following::text()[1])").get().split()[1],
            'Phone Number 1': response.xpath("normalize-space(//div[contains(@class, 'dealerContactPhoneNumbers')]/text())").get(),
            'Phone Number 2': response.xpath("normalize-space(//div[contains(@class, 'dealerContactPhoneNumbers')]/text()/following::text()[1])").get(),
            'Source': response.url
        }

N.B。这是我在 Whosebug 中的第一个 post。如果我有任何错误,请原谅我。

分页如下:

您的代码运行良好。 Starting url: " https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html" 与您提到的相同。如果您单击第一页,那么您将获得此 url,从那里生成数据。我在 [= 中进行分页21=] 使用列表理解。现在您可以随时增加或减少页码范围。在这里我只抓取了五页,您可以抓取总页数或任何您想要的,只需将页码放在范围内即可。我总共抓取了 5 页160 项。

代码:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Baden1Spider(CrawlSpider):
    name = 'baden1'
    allowed_domains = ['home.mobile.de']
    start_urls = ['https://home.mobile.de/regional/baden-w%C3%BCrttemberg/'+ str(x) +'.html' for x in range(0,5)]

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='box']/div[@class='row ']"), callback='parse_item', follow=True),
        # Rule(LinkExtractor(restrict_xpaths="//span[@class='jslink pg-btn page-next']"))
    )

    def parse_item(self, response):
        yield{
            'Dealer Name': response.xpath("//address[@class='fullAddress']/strong/text()").get(),
            'Street': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text())").get(),
            'ZIP Code': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text()/following::text()[1])").get().split()[0],
            'City': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text()/following::text()[1])").get().split()[1],
            'Phone Number 1': response.xpath("normalize-space(//div[contains(@class, 'dealerContactPhoneNumbers')]/text())").get(),
            'Phone Number 2': response.xpath("normalize-space(//div[contains(@class, 'dealerContactPhoneNumbers')]/text()/following::text()[1])").get(),
            'Source': response.url
        }

输出:总输出的一部分。

'Dealer Name': 'Abbas KfZ An- und Verkauf', 'Street': 'schießstattweg 18', 'ZIP Code': '88677', 'City': 'Markdorf', 'Phone Number 1': 'Tel.:\xa0+49 (0)176 56730811', 'Phone Number 2': '', 'Source': 'https://home.mobile.de/ABBASKFZANUNDVERKAUF'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/SCHAIBLEMASCHINENHANDEL>
{'Dealer Name': 'Schaible Maschinenhandel', 'Street': 'In Oberwiesen 7', 'ZIP Code': '88682', 'City': 'Salem', 'Phone Number 1': 'Tel.:\xa0+49 (0)7553 60146', 'Phone Number 2': 'Mobiltelefon:\xa0+49 (0)171 7998515', 'Source': 'https://home.mobile.de/SCHAIBLEMASCHINENHANDEL'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/RUSH-AUTOMOBILE>
{'Dealer Name': 'RUSH Automobile UG (haftungsbeschränkt)', 'Street': 'Hallendorferstrasse 6', 'ZIP Code': '88690', 'City': 'Uhldingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7551 949277', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)171 3608800', 'Source': 'https://home.mobile.de/RUSH-AUTOMOBILE'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/FIRST-CLASS-AUTOMOBILE> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AH-MUTTER> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/LOCFAHRZEUGE>
{'Dealer Name': 'LOC Fahrzeuge OHG', 'Street': 'Meersburger Straße 2', 'ZIP Code': '88690', 'City': 'Uhldingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7556 928597', 'Phone Number 2': 'Fax:\xa0+49 (0)7556 928583', 'Source': 'https://home.mobile.de/LOCFAHRZEUGE'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AH-SCHMID-BERMATINGEN> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/FIRST-CLASS-AUTOMOBILE>
{'Dealer Name': 'First Class Automobile Seit 1989', 'Street': 'Büro: Oberer Höhenweg 29', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)176 20491640', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)7544 91111', 'Source': 'https://home.mobile.de/FIRST-CLASS-AUTOMOBILE'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AH-MUTTER>
{'Dealer Name': 'Autohaus Matthias Mutter', 'Street': 'Salemerstrasse 42', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7544 912100', 'Phone Number 2': 'Fax:\xa0+49 (0)7544 91110', 'Source': 'https://home.mobile.de/AH-MUTTER'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AH-SCHMID-BERMATINGEN>
{'Dealer Name': 'Autohaus Schmid', 'Street': 'Salemer Straße 30', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 7544 2375', 'Phone Number 2': 'Fax:\xa0+49 7544 1355', 'Source': 'https://home.mobile.de/AH-SCHMID-BERMATINGEN'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/YAMAHA-NESENSOHN> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/YAMAHA-NESENSOHN>
{'Dealer Name': 'Yamaha Nesensohn', 'Street': 'Salemerstrasse 51', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7544 2902', 'Phone Number 2': 'Fax:\xa0+49 (0)7544 73025', 'Source': 'https://home.mobile.de/YAMAHA-NESENSOHN'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOHAUS-KIRCHHOFF> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOMOBILEREHM> (referer: 
https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOHAUSSAILERGMBHCOKG> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOHAUS-KIRCHHOFF>
{'Dealer Name': 'Autohaus Kirchhoff', 'Street': 'Am Luckengraben 4', 'ZIP Code': '88699', 'City': 'Frickingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7554 8450', 'Phone Number 2': 'Fax:\xa0+49 (0)7554 8252', 'Source': 'https://home.mobile.de/AUTOHAUS-KIRCHHOFF'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/PATRICKKAYSERHAGNAUAMBODENSEE1> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOHAUSREICHLEOHG> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOMOBILEREHM>
{'Dealer Name': 'Automobile Rehm', 'Street': 'Heidbühlstr. 9', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 175 2234111', 'Phone Number 2': '', 'Source': 'https://home.mobile.de/AUTOMOBILEREHM'}       
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOHAUSSAILERGMBHCOKG>
{'Dealer Name': 'Autohaus Sailer GmbH & Co.KG', 'Street': 'Hofäckerstr. 1', 'ZIP Code': '88697', 'City': 'Bermatingen-Ahausen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7544 968300', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)7544 9683018', 'Source': 'https://home.mobile.de/AUTOHAUSSAILERGMBHCOKG'}
2021-08-06 12:40:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/LACKIERMEISTERBETRIEBKFZSERVICE> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/PATRICKKAYSERHAGNAUAMBODENSEE1>
{'Dealer Name': 'Patrick Kayser', 'Street': 'Langbrühl 6', 'ZIP Code': '88709', 'City': 'Hagnau', 'Phone Number 1': 
'Tel.:\xa0+49 (0)178 6524858', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)7532 4458081', 'Source': 'https://home.mobile.de/PATRICKKAYSERHAGNAUAMBODENSEE1'}
2021-08-06 12:40:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOHAUSREICHLEOHG>       
{'Dealer Name': 'Autohaus Reichle OHG', 'Street': 'Hauptstraße 57', 'ZIP Code': '88699', 'City': 'Frickingen-Altheim', 'Phone Number 1': 'Tel.:\xa0+49 7554 8337', 'Phone Number 2': 'Mobiltelefon:\xa0+49 151 65828855', 'Source': 'https://home.mobile.de/AUTOHAUSREICHLEOHG'}
2021-08-06 12:40:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/LACKIERMEISTERBETRIEBKFZSERVICE>
{'Dealer Name': 'Lackiermeisterbetrieb & KFZ Service', 'Street': 'Lippertsreuterstr. 6b', 'ZIP Code': '88699', 'City': 'Frickingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7554 9892115', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)1525 2160629', 'Source': 'https://home.mobile.de/LACKIERMEISTERBETRIEBKFZSERVICE'}
2021-08-06 12:40:15 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-06 12:40:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369317,
 'downloader/request_count': 165,
 'downloader/request_method_count/GET': 165,
 'downloader/response_bytes': 2468479,
 'downloader/response_count': 165,
 'downloader/response_status_count/200': 165,
 'elapsed_time_seconds': 17.246198,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 8, 6, 6, 40, 15, 130449),
 'httpcompression/response_bytes': 6481573,
 'httpcompression/response_count': 165,
 'item_scraped_count': 160,