Scrapy css 选择器重新给出损坏的 json 字符串

Question

嘿，我是 Python 的新手，特别好斗，我正在尝试废弃 Walmart 。但是我遇到了一个问题。我是从响应

中获取json字符串的正则表达式

__WML_REDUX_INITIAL_STATE__ =*(.*\});\}; 但它有时会给出损坏的 json 字符串，例如 fr this walmart product Due to which json.loads fails 。这是 regx 还是 scrapy 的问题。我不明白为什么会这样

Answer 1

Scrapy/Parsel 的 Selector .re() 和 .re_first() 具有（不幸的）替换 HTML 字符实体引用的默认行为。这会导致 JSON 解码失败。

在 scrapy shell 中使用示例 URL 进行说明。您的正则表达式确实有效，它会选择您想要的数据：

$ scrapy shell https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527 -s USER_AGENT='mozilla'
2017-07-13 15:24:30 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(..)
2017-07-13 15:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.walmart.com/ip/Riders-by-Lee-Women-s-On-the-Go-Performance-Capri/145227527> (referer: None)
>>> data = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};')
>>> data[:25], data[-25:]
(' {"uuid":null,"isMobile":', 'nabled":true,"seller":{}}')

但是将此字符串解码为 JSON 失败：

>>> import json
>>> json.loads(data)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 40598 (char 40597)
>>> data[40500:40650]
'{"values":["<br /> <b>Riders by Lee Women\'s On the Go Performance Capri</b> <br /> <ul>  <li>21" Inseam</li>  <li>Rib knit waist with button and zippe'

双引号引起了麻烦。

您可以使用 replace_entities=False 参数来不替换实体：

>>> dataraw = response.xpath('//script/text()').re_first('__WML_REDUX_INITIAL_STATE__ =*(.*\});\};', replace_entities=False)
>>> dataraw[40500:40650]
'{"values":["<br /> <b>Riders by Lee Women\'s On the Go Performance Capri</b> <br /> <ul>  <li>21&quot; Inseam</li>  <li>Rib knit waist with button and '

看看 " 是如何保持原样的。

现在你可以解码字符串 a JSON:

>>> d = json.loads(dataraw)
>>> d.keys()
dict_keys(['uuid', 'isMobile', 'isBot', 'isAdsEnabled', 'isEsiEnabled', 'isInitialStateDeferred', 'isServiceWorkerEnabled', 'isShellRequest', 'productId', 'product', 'showTrustModal', 'productBasicInfo', 'fulfillmentOptions', 'feedback', 'backLink', 'offersOrder', 'sellersHeading', 'fdaCompliance', 'recommendationMap', 'header', 'footer', 'addToRegistry', 'addToList', 'ads', 'btvMap', 'postQuestion', 'autoPartFinder', 'getPromoStatus', 'discoveryModule', 'lastAction', 'isAjaxCall', 'accessModeEnabled', 'seller'])
>>>

replace_entities 是在 parsel v1.2.0 中引入的。（参见 https://github.com/scrapy/parsel/pull/88）

Scrapy css 选择器重新给出损坏的 json 字符串

Scrapy css selector re gives broken json string

python

regex

scrapy

python-3.6