scrapy-splash 活动内容选择器适用于 shell 但不适用于 spider
scrapy-splash active content selector works in shell but not with spider
我刚开始使用 scrapy-splash 从 opentable.com 中检索预订数量。以下在 shell:
中工作正常
$ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5'
...
In [1]: response.css('div.booking::text').extract()
Out[1]:
['Booked 59 times today',
'Booked 20 times today',
'Booked 17 times today',
'Booked 29 times today',
'Booked 29 times today',
...
]
但是,这个简单的蜘蛛 returns 一个空列表:
class TableSpider(scrapy.Spider):
name = 'opentable'
start_urls = ['https://www.opentable.com/new-york-restaurant-listings']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html',
args={'wait': 1.5},
)
def parse(self, response):
yield {'bookings': response.css('div.booking::text').extract()}
调用时:
$ scrapy crawl opentable
...
DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': []}
我已经试过了
docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
并增加了等待时间。
这不起作用,因为此 Web 内容正在使用 JS。
您可以采用几种解决方案:
1) 使用硒。
2) 如果您看到页面的 API,如果您调用此 url <GET https://www.opentable.com/injector/stats/v1/restaurants/<restaurant_id>/reservations>
您将获得该特定餐厅的当前预订数量 (restaurant_id).
我认为你的问题出在middlewares
,首先你需要添加一些设置
# settings.py
# uncomment `DOWNLOADER_MIDDLEWARES` and add this settings to it
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# url of splash server
SPLASH_URL = 'http://localhost:8050'
# and some splash variables
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
现在运行docker
sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
如果我完成所有这些步骤,就会返回:
scrapy crawl opentable
...
2018-06-23 11:23:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opentable.com/new-york-restaurant-listings via http://localhost:8050/render.html> (referer: None)
2018-06-23 11:23:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': [
'Booked 44 times today',
'Booked 24 times today',
'and many others Booked values'
]}
我刚开始使用 scrapy-splash 从 opentable.com 中检索预订数量。以下在 shell:
中工作正常$ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5'
...
In [1]: response.css('div.booking::text').extract()
Out[1]:
['Booked 59 times today',
'Booked 20 times today',
'Booked 17 times today',
'Booked 29 times today',
'Booked 29 times today',
...
]
但是,这个简单的蜘蛛 returns 一个空列表:
class TableSpider(scrapy.Spider):
name = 'opentable'
start_urls = ['https://www.opentable.com/new-york-restaurant-listings']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html',
args={'wait': 1.5},
)
def parse(self, response):
yield {'bookings': response.css('div.booking::text').extract()}
调用时:
$ scrapy crawl opentable
...
DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': []}
我已经试过了
docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
并增加了等待时间。
这不起作用,因为此 Web 内容正在使用 JS。
您可以采用几种解决方案:
1) 使用硒。
2) 如果您看到页面的 API,如果您调用此 url <GET https://www.opentable.com/injector/stats/v1/restaurants/<restaurant_id>/reservations>
您将获得该特定餐厅的当前预订数量 (restaurant_id).
我认为你的问题出在middlewares
,首先你需要添加一些设置
# settings.py
# uncomment `DOWNLOADER_MIDDLEWARES` and add this settings to it
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# url of splash server
SPLASH_URL = 'http://localhost:8050'
# and some splash variables
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
现在运行docker
sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
如果我完成所有这些步骤,就会返回:
scrapy crawl opentable
...
2018-06-23 11:23:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opentable.com/new-york-restaurant-listings via http://localhost:8050/render.html> (referer: None)
2018-06-23 11:23:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': [
'Booked 44 times today',
'Booked 24 times today',
'and many others Booked values'
]}