Scrapy & Splash 未从 javascript 页面返回任何内容
Scrapy & Splash not returning anything from javascript page
我目前正在关注官方文档和 Youtube 视频,以使用 Scrapy 及其 splash js 渲染服务抓取 javascript 页面。
https://splash.readthedocs.io/en/stable/install.html
https://www.youtube.com/watch?v=VvFC93vAB7U
我在我的 Mac 上安装了 Docker 并且 运行 按照官方文档说明:
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
然后我从 Youtube 视频中获取了这个演示代码:
import scrapy
from scrapy_splash import SplashRequest
class Demo_js_pider(scrapy.Spider):
name = 'jsdemo'
def start_request(self):
yield SplashRequest(
url = 'http://quotes.toscrape.com/js',
callback = self.parse,
)
def parse(self, response):
for quote in response.css("div.quote"):
yield {
'text': quote.css("span.text::text").extract.first(),
'author': quote.css("small.author::text").extract_first(),
'tags': quote.css("div.tags > a.tag::text").extract(),
}
这是 运行 和 'scrapy crawl jsdemo'(我已经在本地 virtualenv (python 3.6.4) 中安装了 scrapy 以及所有正确的模块,包括 scrapy-splash 模块)
然而,当它 运行s 除了以下输出和错误消息外,没有任何返回:
2018-05-11 12:42:27 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-11 12:42:27 [scrapy.core.engine] INFO: Spider opened
2018-05-11 12:42:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-11 12:42:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-11 12:42:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-11 12:42:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 11, 5, 42, 27, 552500),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'memusage/max': 49602560,
'memusage/startup': 49602560,
'start_time': datetime.datetime(2018, 5, 11, 5, 42, 27, 513940)}
2018-05-11 12:42:27 [scrapy.core.engine] INFO: Spider closed (finished)
上面是t运行cated,这是link到完整的输出:https://pastebin.com/yQVp3n6z
我已经试过好几次了。我还尝试了 运行从主要的 Scrapy 教程中获取一个基本的 html 抓取蜘蛛,这个 运行 很好所以我猜错误是在 Splash 的某个地方?
我在输出中也注意到了这一点:
DEBUG: Telnet console listening on 127.0.0.1:6023
这是正确的吗? docker 在 telnet 5023 上命令 运行s Splash,我尝试将其更改为 6023,但没有任何改变。我还尝试将设置中的 TELENTCONSOLE_PORT 设置为 5023 和 6023,这只会在我尝试 运行 scrapy crawl:
时抛出这些错误
Traceback (most recent call last):
File "/Users/david/Documents/projects/cryptoinfluencers/env/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/cmdline.py", line 150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
func(*a, **kw)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/cmdline.py", line 157, in _run_command
cmd.run(args, opts)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 170, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 198, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 203, in _create_crawler
return Crawler(spidercls, self.settings)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 55, in __init__
self.extensions = ExtensionManager.from_crawler(self)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/extensions/telnet.py", line 53, in from_crawler
return cls(crawler)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/extensions/telnet.py", line 46, in __init__
self.portrange = [int(x) for x in crawler.settings.getlist('TELNETCONSOLE_PORT')]
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/settings/__init__.py", line 182, in getlist
return list(value)
TypeError: 'int' object is not iterable
此时我不确定还需要更改什么...
你有一个简单的错字:start_request()
vs start_requests()
你还有一个错别字extract.first()
这是工作代码:
import scrapy
from scrapy_splash import SplashRequest
class Demo_js_pider(scrapy.Spider):
name = 'jsdemo'
def start_requests(self):
yield SplashRequest(
url = 'http://quotes.toscrape.com/js',
callback = self.parse,
)
def parse(self, response):
print("Parsing...\n")
for quote in response.css("div.quote"):
yield {
'text': quote.css("span.text::text").extract_first(),
'author': quote.css("small.author::text").extract_first(),
'tags': quote.css("div.tags > a.tag::text").extract(),
}
我目前正在关注官方文档和 Youtube 视频,以使用 Scrapy 及其 splash js 渲染服务抓取 javascript 页面。
https://splash.readthedocs.io/en/stable/install.html
https://www.youtube.com/watch?v=VvFC93vAB7U
我在我的 Mac 上安装了 Docker 并且 运行 按照官方文档说明:
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
然后我从 Youtube 视频中获取了这个演示代码:
import scrapy
from scrapy_splash import SplashRequest
class Demo_js_pider(scrapy.Spider):
name = 'jsdemo'
def start_request(self):
yield SplashRequest(
url = 'http://quotes.toscrape.com/js',
callback = self.parse,
)
def parse(self, response):
for quote in response.css("div.quote"):
yield {
'text': quote.css("span.text::text").extract.first(),
'author': quote.css("small.author::text").extract_first(),
'tags': quote.css("div.tags > a.tag::text").extract(),
}
这是 运行 和 'scrapy crawl jsdemo'(我已经在本地 virtualenv (python 3.6.4) 中安装了 scrapy 以及所有正确的模块,包括 scrapy-splash 模块)
然而,当它 运行s 除了以下输出和错误消息外,没有任何返回:
2018-05-11 12:42:27 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-11 12:42:27 [scrapy.core.engine] INFO: Spider opened
2018-05-11 12:42:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-11 12:42:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-11 12:42:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-11 12:42:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 11, 5, 42, 27, 552500),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'memusage/max': 49602560,
'memusage/startup': 49602560,
'start_time': datetime.datetime(2018, 5, 11, 5, 42, 27, 513940)}
2018-05-11 12:42:27 [scrapy.core.engine] INFO: Spider closed (finished)
上面是t运行cated,这是link到完整的输出:https://pastebin.com/yQVp3n6z
我已经试过好几次了。我还尝试了 运行从主要的 Scrapy 教程中获取一个基本的 html 抓取蜘蛛,这个 运行 很好所以我猜错误是在 Splash 的某个地方?
我在输出中也注意到了这一点:
DEBUG: Telnet console listening on 127.0.0.1:6023
这是正确的吗? docker 在 telnet 5023 上命令 运行s Splash,我尝试将其更改为 6023,但没有任何改变。我还尝试将设置中的 TELENTCONSOLE_PORT 设置为 5023 和 6023,这只会在我尝试 运行 scrapy crawl:
时抛出这些错误Traceback (most recent call last):
File "/Users/david/Documents/projects/cryptoinfluencers/env/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/cmdline.py", line 150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
func(*a, **kw)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/cmdline.py", line 157, in _run_command
cmd.run(args, opts)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 170, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 198, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 203, in _create_crawler
return Crawler(spidercls, self.settings)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 55, in __init__
self.extensions = ExtensionManager.from_crawler(self)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/extensions/telnet.py", line 53, in from_crawler
return cls(crawler)
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/extensions/telnet.py", line 46, in __init__
self.portrange = [int(x) for x in crawler.settings.getlist('TELNETCONSOLE_PORT')]
File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/settings/__init__.py", line 182, in getlist
return list(value)
TypeError: 'int' object is not iterable
此时我不确定还需要更改什么...
你有一个简单的错字:start_request()
vs start_requests()
你还有一个错别字extract.first()
这是工作代码:
import scrapy
from scrapy_splash import SplashRequest
class Demo_js_pider(scrapy.Spider):
name = 'jsdemo'
def start_requests(self):
yield SplashRequest(
url = 'http://quotes.toscrape.com/js',
callback = self.parse,
)
def parse(self, response):
print("Parsing...\n")
for quote in response.css("div.quote"):
yield {
'text': quote.css("span.text::text").extract_first(),
'author': quote.css("small.author::text").extract_first(),
'tags': quote.css("div.tags > a.tag::text").extract(),
}