Scrapy

Question

This 是我正在抓取的网站的站点地图。第 3 和第 4 <sitemap> 节点具有转到项目详细信息的 url。有没有办法只对那些应用爬行逻辑节点？（比如按索引选择它们）

class MySpider(SitemapSpider):

    name = 'myspider'

    sitemap_urls = [
        'https://www.dfimoveis.com.br/sitemap_index.xml',
    ]

    sitemap_rules = [
        ('/somehow targeting the 3rd and 4th node', 'parse_item')
    ]


    def parse_item(self, response):
        # scraping the item

Answer 1

Scrapy 的 Spider 子类，包括 SitemapSpider 旨在使非常常见的场景变得非常简单。

您想做一些不常见的事情，因此您应该阅读 SitemapSpider 的源代码，尝试理解它的作用，然后子类 SitemapSpider 覆盖您想要的行为根据SitemapSpider.

的代码改或者直接自己写spider

Answer 2

您不需要使用 SitemapSpider，只需使用 regex 和标准蜘蛛。

def start_requests(self):
    sitemap = 'https://www.dfimoveis.com.br/sitemap_index.xml'
    yield scrapy.Request(url=sitemap, callback=self.parse_sitemap)

def parse_sitemap(self, response):
    sitemap_links = re.findall(r"<loc>(.*?)</loc>", response.text, re.DOTALL)
    sitemap_links = sitemap_links[2:4]  # Only 3rd and 4th nodes.
        for sitemap_link in sitemap_links:
            yield scrapy.Request(url=sitemap_link, callback=self.parse)

Scrapy - 选择并爬取特定类型的站点地图节点

Scrapy - Selecting and crawling a specific type of sitemap nodes

python

xml

sitemap

web-crawler