为什么 scrapy_splash CrawlSpider 与使用 Selenium 的 scrapy 花费的时间相同?
Why does scrapy_splash CrawlSpider take the same amount of time as scrapy with Selenium?
我有以下 scrapy CrawlSpider
:
import logger as lg
from scrapy.crawler import CrawlerProcess
from scrapy.http import Response
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashTextResponse
from urllib.parse import urlencode
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse
logger = lg.get_logger("oddsportal_spider")
class SeleniumScraper(CrawlSpider):
name = "splash"
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"DOWNLOADER_MIDDLEWARES": {
'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
},
}
httperror_allowed_codes = [301]
start_urls = ["https://www.oddsportal.com/tennis/results/"]
rules = (
Rule(
LinkExtractor(allow="/atp-buenos-aires/results/"),
callback="parse_tournament",
follow=True,
),
Rule(
LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[@class='name table-participant']//a"),
),
callback="parse_match",
),
)
def parse_tournament(self, response: Response):
logger.info(f"Parsing tournament - {response.url}")
def parse_match(self, response: Response):
logger.info(f"Parsing match - {response.url}")
process = CrawlerProcess()
process.crawl(SeleniumScraper)
process.start()
Selenium中间件如下:
class SeleniumMiddleware:
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
logger.debug(f"Selenium processing request - {request.url}")
self.driver.get(request.url)
return HtmlResponse(
request.url,
body=self.driver.page_source,
encoding='utf-8',
request=request,
)
def spider_opened(self, spider):
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
self.driver = webdriver.Firefox(
options=options,
executable_path=Path("/opt/geckodriver/geckodriver"),
)
def spider_closed(self, spider):
self.driver.close()
看完大约 50 页,大约需要一分钟。为了尝试加快速度并利用多线程和 Javascript 我已经实现了以下 scrapy_splash 蜘蛛:
class SplashScraper(CrawlSpider):
name = "splash"
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"SPLASH_URL": "http://localhost:8050",
"DOWNLOADER_MIDDLEWARES": {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
"SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
"DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
"HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
}
httperror_allowed_codes = [301]
start_urls = ["https://www.oddsportal.com/tennis/results/"]
rules = (
Rule(
LinkExtractor(allow="/atp-buenos-aires/results/"),
callback="parse_tournament",
process_request="use_splash",
follow=True,
),
Rule(
LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[@class='name table-participant']//a"),
),
callback="parse_match",
process_request="use_splash",
),
)
def process_links(self, links):
for link in links:
link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url})
return links
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashTextResponse)):
return
seen = set()
for rule_index, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
for link in rule.process_links(links):
seen.add(link)
request = self._build_request(rule_index, link)
yield rule.process_request(request, response)
def use_splash(self, request, response):
request.meta.update(splash={'endpoint': 'render.html'})
return request
def parse_tournament(self, response: Response):
logger.info(f"Parsing tournament - {response.url}")
def parse_match(self, response: Response):
logger.info(f"Parsing match - {response.url}")
但是,这需要大约相同的时间。我希望看到速度大幅提高:(
我试过使用不同的 DOWNLOAD_DELAY
设置,但这并没有使事情变得更快。
所有并发设置均保留默认值。
关于 if/how 我做错了什么?
在没有图书馆经验的情况下在这里尝试回答。
看起来 Scrapy Crawlers 本身就是 single-threaded。要获得 multi-threaded 行为,您需要以不同方式配置您的应用程序或编写使其行为 mulit-threaded 的代码。听起来您已经尝试过了,所以这对您来说可能不是新闻,但请确保您已经配置了 CONCURRENT_REQUESTS
和 REACTOR_THREADPOOL_MAXSIZE
.
https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize
我无法想象在抓取过程中有很多 CPU 工作正在进行,所以我怀疑这是一个 GIL 问题。
排除 GIL 作为一个选项,这里有两种可能性:
- 您的爬虫实际上不是 multi-threaded。这可能是因为您缺少一些使它成为现实的设置或配置。也就是说,您可能已经正确设置了 env 变量,但是您的爬虫是以同步处理 url 请求的方式编写的,而不是将它们提交到 queue.
要对此进行测试,请创建一个全局对象并在其上存储一个计数器。每次您的爬虫启动请求时,都会增加计数器。每次您的爬虫完成请求时,递减计数器。然后 运行 每秒打印计数器的线程。如果您的计数器值始终为 1,那么您仍在同步 运行ning。
# global_state.py
GLOBAL_STATE = {"counter": 0}
# middleware.py
from global_state import GLOBAL_STATE
class SeleniumMiddleware:
def process_request(self, request, spider):
GLOBAL_STATE["counter"] += 1
self.driver.get(request.url)
GLOBAL_STATE["counter"] -= 1
...
# main.py
from global_state import GLOBAL_STATE
import threading
import time
def main():
gst = threading.Thread(target=gs_watcher)
gst.start()
# Start your app here
def gs_watcher():
while True:
print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
time.sleep(1)
- 您正在抓取的网站限制了您的速度。
为了对此进行测试,运行 应用程序多次。如果每个应用程序从 50 req/s 增加到 25 req/s,那么您将受到速率限制。要避开这个问题,请使用 VPN hop-around.
如果在那之后你发现你运行并发请求,而你没有速率限制,然后图书馆里发生了一些时髦的事情。尝试删除大块代码,直到达到最低限度的需要抓取的内容。如果您已经达到绝对最低限度的实现,但它仍然很慢,那么您现在有一个最小的可重现示例,可以获得很多 better/informed 帮助。
我有以下 scrapy CrawlSpider
:
import logger as lg
from scrapy.crawler import CrawlerProcess
from scrapy.http import Response
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashTextResponse
from urllib.parse import urlencode
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse
logger = lg.get_logger("oddsportal_spider")
class SeleniumScraper(CrawlSpider):
name = "splash"
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"DOWNLOADER_MIDDLEWARES": {
'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543,
},
}
httperror_allowed_codes = [301]
start_urls = ["https://www.oddsportal.com/tennis/results/"]
rules = (
Rule(
LinkExtractor(allow="/atp-buenos-aires/results/"),
callback="parse_tournament",
follow=True,
),
Rule(
LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[@class='name table-participant']//a"),
),
callback="parse_match",
),
)
def parse_tournament(self, response: Response):
logger.info(f"Parsing tournament - {response.url}")
def parse_match(self, response: Response):
logger.info(f"Parsing match - {response.url}")
process = CrawlerProcess()
process.crawl(SeleniumScraper)
process.start()
Selenium中间件如下:
class SeleniumMiddleware:
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
logger.debug(f"Selenium processing request - {request.url}")
self.driver.get(request.url)
return HtmlResponse(
request.url,
body=self.driver.page_source,
encoding='utf-8',
request=request,
)
def spider_opened(self, spider):
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
self.driver = webdriver.Firefox(
options=options,
executable_path=Path("/opt/geckodriver/geckodriver"),
)
def spider_closed(self, spider):
self.driver.close()
看完大约 50 页,大约需要一分钟。为了尝试加快速度并利用多线程和 Javascript 我已经实现了以下 scrapy_splash 蜘蛛:
class SplashScraper(CrawlSpider):
name = "splash"
custom_settings = {
"USER_AGENT": "*",
"LOG_LEVEL": "WARNING",
"SPLASH_URL": "http://localhost:8050",
"DOWNLOADER_MIDDLEWARES": {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
"SPIDER_MIDDLEWARES": {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100},
"DUPEFILTER_CLASS": 'scrapy_splash.SplashAwareDupeFilter',
"HTTPCACHE_STORAGE": 'scrapy_splash.SplashAwareFSCacheStorage',
}
httperror_allowed_codes = [301]
start_urls = ["https://www.oddsportal.com/tennis/results/"]
rules = (
Rule(
LinkExtractor(allow="/atp-buenos-aires/results/"),
callback="parse_tournament",
process_request="use_splash",
follow=True,
),
Rule(
LinkExtractor(
allow="/tennis/",
restrict_xpaths=("//td[@class='name table-participant']//a"),
),
callback="parse_match",
process_request="use_splash",
),
)
def process_links(self, links):
for link in links:
link.url = "http://localhost:8050/render.html?" + urlencode({'url' : link.url})
return links
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashTextResponse)):
return
seen = set()
for rule_index, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
for link in rule.process_links(links):
seen.add(link)
request = self._build_request(rule_index, link)
yield rule.process_request(request, response)
def use_splash(self, request, response):
request.meta.update(splash={'endpoint': 'render.html'})
return request
def parse_tournament(self, response: Response):
logger.info(f"Parsing tournament - {response.url}")
def parse_match(self, response: Response):
logger.info(f"Parsing match - {response.url}")
但是,这需要大约相同的时间。我希望看到速度大幅提高:(
我试过使用不同的 DOWNLOAD_DELAY
设置,但这并没有使事情变得更快。
所有并发设置均保留默认值。
关于 if/how 我做错了什么?
在没有图书馆经验的情况下在这里尝试回答。
看起来 Scrapy Crawlers 本身就是 single-threaded。要获得 multi-threaded 行为,您需要以不同方式配置您的应用程序或编写使其行为 mulit-threaded 的代码。听起来您已经尝试过了,所以这对您来说可能不是新闻,但请确保您已经配置了 CONCURRENT_REQUESTS
和 REACTOR_THREADPOOL_MAXSIZE
.
https://docs.scrapy.org/en/latest/topics/settings.html?highlight=thread#reactor-threadpool-maxsize
我无法想象在抓取过程中有很多 CPU 工作正在进行,所以我怀疑这是一个 GIL 问题。
排除 GIL 作为一个选项,这里有两种可能性:
- 您的爬虫实际上不是 multi-threaded。这可能是因为您缺少一些使它成为现实的设置或配置。也就是说,您可能已经正确设置了 env 变量,但是您的爬虫是以同步处理 url 请求的方式编写的,而不是将它们提交到 queue.
要对此进行测试,请创建一个全局对象并在其上存储一个计数器。每次您的爬虫启动请求时,都会增加计数器。每次您的爬虫完成请求时,递减计数器。然后 运行 每秒打印计数器的线程。如果您的计数器值始终为 1,那么您仍在同步 运行ning。
# global_state.py
GLOBAL_STATE = {"counter": 0}
# middleware.py
from global_state import GLOBAL_STATE
class SeleniumMiddleware:
def process_request(self, request, spider):
GLOBAL_STATE["counter"] += 1
self.driver.get(request.url)
GLOBAL_STATE["counter"] -= 1
...
# main.py
from global_state import GLOBAL_STATE
import threading
import time
def main():
gst = threading.Thread(target=gs_watcher)
gst.start()
# Start your app here
def gs_watcher():
while True:
print(f"Concurrent requests: {GLOBAL_STATE['counter']}")
time.sleep(1)
- 您正在抓取的网站限制了您的速度。
为了对此进行测试,运行 应用程序多次。如果每个应用程序从 50 req/s 增加到 25 req/s,那么您将受到速率限制。要避开这个问题,请使用 VPN hop-around.
如果在那之后你发现你运行并发请求,而你没有速率限制,然后图书馆里发生了一些时髦的事情。尝试删除大块代码,直到达到最低限度的需要抓取的内容。如果您已经达到绝对最低限度的实现,但它仍然很慢,那么您现在有一个最小的可重现示例,可以获得很多 better/informed 帮助。