如何使用 Scrapy 高效爬取网站

Question

我正在尝试使用 Scrapy 和 PyCharm 对房地产网站进行网络抓取，但惨遭失败。

期望的结果：

刮取 1 个碱基 URL (https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/), but 5 different internal URLs (https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/{**i**}-r/)，其中 {i} = 1,2,3,4,5
抓取每个内部 URL 或使用基础 URL
获取所有 href link 并抓取所有 href link 并从每个 href link.
尽可能高效快速地搜集大约 5,000-7,000 个独特的列表。
将数据输出到 CSV 文件中，同时保留西里尔字符。

注意：我曾尝试使用 BeautifulSoup 进行网页抓取，但每个列表大约需要 1-2 分钟，使用 for 循环抓取所有列表大约需要 2-3 小时.我被社区成员提到 Scrapy 是更快的选择。我不确定它是否是数据管道的原因或者我是否可以进行多线程处理。

非常感谢所有帮助。^^

网站示例HTML 片段： 这是我正在尝试抓取的HTML 的图片。

当前的 Scrapy 代码： 这是我目前所拥有的。当我使用 scrapy crawl unegui_apts 时，我似乎无法获得我想要的结果。我迷路了。

# -*- coding: utf-8 -*-

# Import library
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request


# Create Spider class
class UneguiApartments(scrapy.Spider):
    name = 'unegui_apts'
    allowed_domains = ['www.unegui.mn']
    custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}}}
    start_urls = [
        'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/,'
        'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/2-r/'
        ]
    headers = {
        'user-agent': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
    }

    def parse(self, response):
        self.logger.debug('callback "parse": got response %r' % response)
        cards = response.xpath('//div[@class="list-announcement-block"]')
        for card in cards:
            name = card.xpath('.//meta[@itemprop="name"]/text()').extract_first()
            price = card.xpath('.//meta[@itemprop="price"]/text()').extract_first()
            city = card.xpath('.//meta[@itemprop="areaServed"]/text()').extract_first()
            date = card.xpath('.//*[@class="announcement-block__date"]/text()').extract_first().strip().split(', ')[0]

            request = Request(link, callback=self.parse_details, meta={'name': name,
                                                                       'price': price,
                                                                       'city': city,
                                                                       'date': date})
            yield request

        next_url = response.xpath('//li[@class="pager-next"]/a/@href').get()
        if next_url:
            # go to next page until no more pages
            yield response.follow(next_url, callback=self.parse)

    # main driver
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(UneguiApartments)
    process.start()

Answer 1

您的代码有很多问题：

start_urls 列表包含无效链接
您在 headers 字典中定义了 user_agent 字符串，但在生成 requests
您的 xpath 选择器不正确
next_url 不正确，因此不会产生对下一页的新请求

我已更新您的代码以修复上述问题，如下所示：

import scrapy
from scrapy.crawler import CrawlerProcess

# Create Spider class
class UneguiApartments(scrapy.Spider):
    name = 'unegui_apts'
    allowed_domains = ['www.unegui.mn']
    custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}},
                       'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"}
    start_urls = [
        'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/'
    ]

    def parse(self, response):
        cards = response.xpath(
            '//li[contains(@class,"announcement-container")]')
        for card in cards:
            name = card.xpath(".//a[@itemprop='name']/@content").extract_first()
            price = card.xpath(".//*[@itemprop='price']/@content").extract_first()
            date = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first()
            city = card.xpath(".//*[@itemprop='areaServed']/@content").extract_first()

            yield {'name': name,
                   'price': price,
                   'city': city,
                   'date': date}

        next_url = response.xpath("//a[contains(@class,'red')]/parent::li/following-sibling::li/a/@href").extract_first()
        if next_url:
            # go to next page until no more pages
            yield response.follow(next_url, callback=self.parse)


    # main driver
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(UneguiApartments)
    process.start()

你运行通过执行命令 python <filename.py> 上面的蜘蛛，因为你是运行一个独立的脚本而不是一个完整的项目。

示例csv结果如下图所示。您将需要使用 pipelines 和 scrapy item class 清理数据。有关详细信息，请参阅 docs。

如何使用 Scrapy 高效爬取网站

How to efficiently crawl a website using Scrapy

html

web-crawler

scrapy

web-scraping