Scrapy css 选择器重新给出损坏的 json 字符串
Scrapy css selector re gives broken json string
嘿,我是 Python 的新手,特别好斗,我正在尝试废弃 Walmart 。但是我遇到了一个问题。我是从响应
中获取json字符串的正则表达式
__WML_REDUX_INITIAL_STATE__ =*(.*\});\};
但它有时会给出损坏的 json 字符串,例如 fr this walmart product Due to which json.loads fails 。这是 regx 还是 scrapy 的问题。我不明白为什么会这样
Scrapy/Parsel 的 Selector
.re()
和 .re_first()
具有(不幸的)替换 HTML 字符实体引用的默认行为。
这会导致 JSON 解码失败。
在 scrapy shell 中使用示例 URL 进行说明。您的正则表达式确实有效,它会选择您想要的数据:
$ scrapy shell https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527 -s USER_AGENT='mozilla'
2017-07-13 15:24:30 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(..)
2017-07-13 15:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527> (referer: None)
>>> data = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};')
>>> data[:25], data[-25:]
(' {"uuid":null,"isMobile":', 'nabled":true,"seller":{}}')
但是将此字符串解码为 JSON 失败:
>>> import json
>>> json.loads(data)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 40598 (char 40597)
>>> data[40500:40650]
'{"values":["<br /> <b>Riders by Lee Women\'s On the Go Performance Capri</b> <br /> <ul> <li>21" Inseam</li> <li>Rib knit waist with button and zippe'
双引号引起了麻烦。
您可以使用 replace_entities=False
参数来不替换实体:
>>> dataraw = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};', replace_entities=False)
>>> dataraw[40500:40650]
'{"values":["<br /> <b>Riders by Lee Women\'s On the Go Performance Capri</b> <br /> <ul> <li>21" Inseam</li> <li>Rib knit waist with button and '
看看 "
是如何保持原样的。
现在你可以解码字符串 a JSON:
>>> d = json.loads(dataraw)
>>> d.keys()
dict_keys(['uuid', 'isMobile', 'isBot', 'isAdsEnabled', 'isEsiEnabled', 'isInitialStateDeferred', 'isServiceWorkerEnabled', 'isShellRequest', 'productId', 'product', 'showTrustModal', 'productBasicInfo', 'fulfillmentOptions', 'feedback', 'backLink', 'offersOrder', 'sellersHeading', 'fdaCompliance', 'recommendationMap', 'header', 'footer', 'addToRegistry', 'addToList', 'ads', 'btvMap', 'postQuestion', 'autoPartFinder', 'getPromoStatus', 'discoveryModule', 'lastAction', 'isAjaxCall', 'accessModeEnabled', 'seller'])
>>>
replace_entities
是在 parsel v1.2.0 中引入的。 (参见 https://github.com/scrapy/parsel/pull/88)
嘿,我是 Python 的新手,特别好斗,我正在尝试废弃 Walmart 。但是我遇到了一个问题。我是从响应
中获取json字符串的正则表达式__WML_REDUX_INITIAL_STATE__ =*(.*\});\};
但它有时会给出损坏的 json 字符串,例如 fr this walmart product Due to which json.loads fails 。这是 regx 还是 scrapy 的问题。我不明白为什么会这样
Scrapy/Parsel 的 Selector
.re()
和 .re_first()
具有(不幸的)替换 HTML 字符实体引用的默认行为。
这会导致 JSON 解码失败。
在 scrapy shell 中使用示例 URL 进行说明。您的正则表达式确实有效,它会选择您想要的数据:
$ scrapy shell https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527 -s USER_AGENT='mozilla'
2017-07-13 15:24:30 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(..)
2017-07-13 15:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527> (referer: None)
>>> data = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};')
>>> data[:25], data[-25:]
(' {"uuid":null,"isMobile":', 'nabled":true,"seller":{}}')
但是将此字符串解码为 JSON 失败:
>>> import json
>>> json.loads(data)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 40598 (char 40597)
>>> data[40500:40650]
'{"values":["<br /> <b>Riders by Lee Women\'s On the Go Performance Capri</b> <br /> <ul> <li>21" Inseam</li> <li>Rib knit waist with button and zippe'
双引号引起了麻烦。
您可以使用 replace_entities=False
参数来不替换实体:
>>> dataraw = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};', replace_entities=False)
>>> dataraw[40500:40650]
'{"values":["<br /> <b>Riders by Lee Women\'s On the Go Performance Capri</b> <br /> <ul> <li>21" Inseam</li> <li>Rib knit waist with button and '
看看 "
是如何保持原样的。
现在你可以解码字符串 a JSON:
>>> d = json.loads(dataraw)
>>> d.keys()
dict_keys(['uuid', 'isMobile', 'isBot', 'isAdsEnabled', 'isEsiEnabled', 'isInitialStateDeferred', 'isServiceWorkerEnabled', 'isShellRequest', 'productId', 'product', 'showTrustModal', 'productBasicInfo', 'fulfillmentOptions', 'feedback', 'backLink', 'offersOrder', 'sellersHeading', 'fdaCompliance', 'recommendationMap', 'header', 'footer', 'addToRegistry', 'addToList', 'ads', 'btvMap', 'postQuestion', 'autoPartFinder', 'getPromoStatus', 'discoveryModule', 'lastAction', 'isAjaxCall', 'accessModeEnabled', 'seller'])
>>>
replace_entities
是在 parsel v1.2.0 中引入的。 (参见 https://github.com/scrapy/parsel/pull/88)