scrapy hal+json 不支持的响应类型
scrapy hal+json unsupported response type
我正在尝试根据 Firefox 和 Safari 抓取一个 link,即 HAL+Json,它返回一个 Scrapy 无法识别的响应对象。
link 是 https://catalogue.presto.com.au/ - 这在 Chrome 中打开正常,在浏览器中显示 JSON,但如果我尝试使用 Firefox 或 Safari,它会下载文件。我怀疑 Scrapy 在打开 link 下载文件时没有抓取文件。
有没有人遇到类似的情况或者有解决办法?
通过Shell
访问
当我尝试使用终端访问网站时"scrapy shell https://catalogue.presto.com.au"
"2015-03-15 00:15:08+0700 [default] DEBUG: Crawled (200) <GET https://catalogue.presto.com.au>"
然后我尝试查看(响应)并收到此错误:
>>> view(response)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Library/Python/2.7/site-packages/scrapy/utils/response.py", line 86, in open_in_browser
response.__class__.__name__)
TypeError: Unsupported response type: Response
运行 Scrapy对象:
def parse(self, response):
print response.__class__
open_in_browser(response)
2015-03-15 00:23:05+0700 [prestotv2] DEBUG: Crawled (200) <GET
class 'scrapy.http.response.Response' (referer: None) #this line is from "print response.__class__
2015-03-15 00:23:05+0700 [prestotv2] ERROR: Spider error processing <GET https://catalogue.presto.com.au/>
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/nathansu/Documents/Development/Whutstream/scraping/Presto/presto/spiders/TvSpider.py", line 38, in parse
open_in_browser(response)
File "/Library/Python/2.7/site-packages/scrapy/utils/response.py", line 86, in open_in_browser
response.__class__.__name__)
exceptions.TypeError: Unsupported response type: Response
这是因为响应 Content-Type
等于 application/hal+json
。如果你想解析它,通过 json.loads()
(or use one of the libraries listed here) 加载它:
$ scrapy shell https://catalogue.presto.com.au/
In [1]: response.headers
Out[1]:
{'Age': '0',
'Cache-Control': 'max-age=300, public, s-maxage=300',
'Content-Type': 'application/hal+json', # HERE
'Date': 'Sat, 14 Mar 2015 17:42:45 GMT',
'Etag': '"834550fbc4b5fc5a188bd801c45876b7613b998b"',
'Expires': 'Sat, 14 Mar 2015 17:47:45 GMT',
'Last-Modified': 'Sat, 14 Mar 2015 17:42:45 GMT',
'Server': 'Apache/2.2.3 (Red Hat)',
'Vary': 'Accept,Accept-Encoding',
'Via': '1.1 varnish',
'X-Powered-By': 'PHP/5.4.15',
'X-Varnish': '905097089'}
In [2]: import json
In [3]: json.loads(response.body)
Out[3]:
{u'_links': {u'curies': [{u'href': u'/rels/{rel}',
u'name': u'ooyala',
u'templated': True}],
...
{window?}&size={size?}&discovery_profile_id={discovery_profile_id?}&exclude_videos={exclude_videos?}&offer_type={offer_type}',
u'templated': True,
u'title': u'Trending series'},
u'self': {u'href': u'/'}},
u'version': u'1.6.0.1'}
我正在尝试根据 Firefox 和 Safari 抓取一个 link,即 HAL+Json,它返回一个 Scrapy 无法识别的响应对象。
link 是 https://catalogue.presto.com.au/ - 这在 Chrome 中打开正常,在浏览器中显示 JSON,但如果我尝试使用 Firefox 或 Safari,它会下载文件。我怀疑 Scrapy 在打开 link 下载文件时没有抓取文件。
有没有人遇到类似的情况或者有解决办法?
通过Shell
访问当我尝试使用终端访问网站时"scrapy shell https://catalogue.presto.com.au"
"2015-03-15 00:15:08+0700 [default] DEBUG: Crawled (200) <GET https://catalogue.presto.com.au>"
然后我尝试查看(响应)并收到此错误:
>>> view(response)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Library/Python/2.7/site-packages/scrapy/utils/response.py", line 86, in open_in_browser
response.__class__.__name__)
TypeError: Unsupported response type: Response
运行 Scrapy对象:
def parse(self, response):
print response.__class__
open_in_browser(response)
2015-03-15 00:23:05+0700 [prestotv2] DEBUG: Crawled (200) <GET
class 'scrapy.http.response.Response' (referer: None) #this line is from "print response.__class__
2015-03-15 00:23:05+0700 [prestotv2] ERROR: Spider error processing <GET https://catalogue.presto.com.au/>
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/nathansu/Documents/Development/Whutstream/scraping/Presto/presto/spiders/TvSpider.py", line 38, in parse
open_in_browser(response)
File "/Library/Python/2.7/site-packages/scrapy/utils/response.py", line 86, in open_in_browser
response.__class__.__name__)
exceptions.TypeError: Unsupported response type: Response
这是因为响应 Content-Type
等于 application/hal+json
。如果你想解析它,通过 json.loads()
(or use one of the libraries listed here) 加载它:
$ scrapy shell https://catalogue.presto.com.au/
In [1]: response.headers
Out[1]:
{'Age': '0',
'Cache-Control': 'max-age=300, public, s-maxage=300',
'Content-Type': 'application/hal+json', # HERE
'Date': 'Sat, 14 Mar 2015 17:42:45 GMT',
'Etag': '"834550fbc4b5fc5a188bd801c45876b7613b998b"',
'Expires': 'Sat, 14 Mar 2015 17:47:45 GMT',
'Last-Modified': 'Sat, 14 Mar 2015 17:42:45 GMT',
'Server': 'Apache/2.2.3 (Red Hat)',
'Vary': 'Accept,Accept-Encoding',
'Via': '1.1 varnish',
'X-Powered-By': 'PHP/5.4.15',
'X-Varnish': '905097089'}
In [2]: import json
In [3]: json.loads(response.body)
Out[3]:
{u'_links': {u'curies': [{u'href': u'/rels/{rel}',
u'name': u'ooyala',
u'templated': True}],
...
{window?}&size={size?}&discovery_profile_id={discovery_profile_id?}&exclude_videos={exclude_videos?}&offer_type={offer_type}',
u'templated': True,
u'title': u'Trending series'},
u'self': {u'href': u'/'}},
u'version': u'1.6.0.1'}