为什么 scrapy_splash CrawlSpider 与使用 Selenium 的 scrapy 花费的时间相同?

Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?

我有以下 scrapy CrawlSpider:

import logger as lg
from scrapy.crawler import CrawlerProcess
from scrapy.http import Response
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashTextResponse
from urllib.parse import urlencode
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse

logger = lg.get_logger("oddsportal_spider")


class SeleniumScraper(CrawlSpider):
    
    name = "splash"
    
    custom_settings = {
        "USER_AGENT": "*",
        "LOG_LEVEL": "WARNING",
        "DOWNLOADER_MIDDLEWARES": {
            'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
        },
    }

    httperror_allowed_codes = [301]
    
    start_urls = ["https://www.oddsportal.com/tennis/results/"]
    
    rules = (
        Rule(
            LinkExtractor(allow="/atp-buenos-aires/results/"),
            callback="parse_tournament",
            follow=True,
        ),
        Rule(
            LinkExtractor(
                allow="/tennis/",
                restrict_xpaths=("//td[@class='name table-participant']//a"),
            ),
            callback="parse_match",
        ),
    )

    def parse_tournament(self, response: Response):
        logger.info(f"Parsing tournament - {response.url}")
    
    def parse_match(self, response: Response):
        logger.info(f"Parsing match - {response.url}")


process = CrawlerProcess()
process.crawl(SeleniumScraper)
process.start()

Selenium中间件如下:

class SeleniumMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        return middleware

    def process_request(self, request, spider):
        logger.debug(f"Selenium processing request - {request.url}")
        self.driver.get(request.url)
        return HtmlResponse(
            request.url,
            body=self.driver.page_source,
            encoding='utf-8',
            request=request,
        )

    def spider_opened(self, spider):
        options = webdriver.FirefoxOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Firefox(
            options=options,
            executable_path=Path("/opt/geckodriver/geckodriver"),
        )

    def spider_closed(self, spider):
        self.driver.close()

看完大约 50 页,大约需要一分钟。为了尝试加快速度并利用多线程和 Javascript 我已经实现了以下 scrapy_splash 蜘蛛:

class SplashScraper(CrawlSpider):
    
    name = "splash"
    
    custom_settings = {
        "USER_AGENT": "*",
        "LOG_LEVEL": "WARNING",
        "SPLASH_URL": "http://localhost:8050",
        "DOWNLOADER_MIDDLEWARES": {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        "SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
        "DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
        "HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
    }

    httperror_allowed_codes = [301]
    
    start_urls = ["https://www.oddsportal.com/tennis/results/"]
    
    rules = (
        Rule(
            LinkExtractor(allow="/atp-buenos-aires/results/"),
            callback="parse_tournament",
            process_request="use_splash",
            follow=True,
        ),
        Rule(
            LinkExtractor(
                allow="/tennis/",
                restrict_xpaths=("//td[@class='name table-participant']//a"),
            ),
            callback="parse_match",
            process_request="use_splash",
        ),
    )

    def process_links(self, links): 
        for link in links: 
            link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url}) 
        return links

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashTextResponse)):
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            for link in rule.process_links(links):
                seen.add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def use_splash(self, request, response):
        request.meta.update(splash={'endpoint': 'render.html'})
        return request

    def parse_tournament(self, response: Response):
        logger.info(f"Parsing tournament - {response.url}")
    
    def parse_match(self, response: Response):
        logger.info(f"Parsing match - {response.url}")

但是,这需要大约相同的时间。我希望看到速度大幅提高:(

我试过使用不同的 DOWNLOAD_DELAY 设置,但这并没有使事情变得更快。

所有并发设置均保留默认值。

关于 if/how 我做错了什么?

在没有图书馆经验的情况下在这里尝试回答。

看起来 Scrapy Crawlers 本身就是 single-threaded。要获得 multi-threaded 行为,您需要以不同方式配置您的应用程序或编写使其行为 mulit-threaded 的代码。听起来您已经尝试过了,所以这对您来说可能不是新闻,但请确保您已经配置了 CONCURRENT_REQUESTSREACTOR_THREADPOOL_MAXSIZE.

https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize

我无法想象在抓取过程中有很多 CPU 工作正在进行,所以我怀疑这是一个 GIL 问题。

排除 GIL 作为一个选项,这里有两种可能性:

  1. 您的爬虫实际上不是 multi-threaded。这可能是因为您缺少一些使它成为现实的设置或配置。也就是说,您可能已经正确设置了 env 变量,但是您的爬虫是以同步处理 url 请求的方式编写的,而不是将它们提交到 queue.

要对此进行测试,请创建一个全局对象并在其上存储一个计数器。每次您的爬虫启动请求时,都会增加计数器。每次您的爬虫完成请求时,递减计数器。然后 运行 每秒打印计数器的线程。如果您的计数器值始终为 1,那么您仍在同步 运行ning。

# global_state.py

GLOBAL_STATE = {"counter": 0}

# middleware.py

from global_state import GLOBAL_STATE

class SeleniumMiddleware:

    def process_request(self, request, spider):
        GLOBAL_STATE["counter"] += 1
        self.driver.get(request.url)
        GLOBAL_STATE["counter"] -= 1

        ...

# main.py

from global_state import GLOBAL_STATE
import threading
import time

def main():
  gst = threading.Thread(target=gs_watcher)
  gst.start()

  # Start your app here

def gs_watcher():
  while True:
    print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
    time.sleep(1)
  1. 您正在抓取的网站限制了您的速度。

为了对此进行测试,运行 应用程序多次。如果每个应用程序从 50 req/s 增加到 25 req/s,那么您将受到速率限制。要避开这个问题,请使用 VPN hop-around.


如果在那之后你发现你运行并发请求,而你没有速率限制,然后图书馆里发生了一些时髦的事情。尝试删除大块代码,直到达到最低限度的需要抓取的内容。如果您已经达到绝对最低限度的实现,但它仍然很慢,那么您现在有一个最小的可重现示例,可以获得很多 better/informed 帮助。