项目的管道不是 JSON 可序列化

Question

我正在尝试将抓取的 xml 的输出写入 json。由于项目不可序列化，抓取失败。

根据这个问题，它建议您需要构建管道，未提供超出问题范围的答案 SO scrapy serializer

所以参考scrapy docs 它说明了一个例子，但是文档然后建议不要使用这个

The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

如果我去 feed exports 会显示这个

JSON

FEED_FORMAT: json Exporter used: JsonItemExporter See this warning if you’re using JSON with large feeds.

我的问题仍然存在，因为据我所知是从命令行执行的。

scrapy runspider myxml.py -o ~/items.json -t json

但是，这造成了我打算使用管道解决的错误。

TypeError: <bound method SelectorList.extract of [<Selector xpath='.//@venue' data=u'Royal Randwick'>]> is not JSON serializable

如何创建 json 管道来纠正 json 序列化错误？

这是我的代码。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.selector import XmlXPathSelector
from conv_xml.items import ConvXmlItem
# 
import json


class MyxmlSpider(scrapy.Spider):
    name = "myxml"

    start_urls = (
        ["file:///home/sayth/Downloads/20160123RAND0.xml"]
    )

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//meeting')
        items = []

        for site in sites:
            item = ConvXmlItem()
            item['venue'] = site.xpath('.//@venue').extract
            item['name'] = site.xpath('.//race/@id').extract()
            item['url'] = site.xpath('.//race/@number').extract()
            item['description'] = site.xpath('.//race/@distance').extract()
            items.append(item)

        return items


        # class JsonWriterPipeline(object):
        #
        #     def __init__(self):
        #         self.file = open('items.jl', 'wb')
        #
        #     def process_item(self, item, spider):
        #         line = json.dumps(dict(item)) + "\n"
        #         self.file.write(line)
        #         return item

Answer 1

问题出在这里：

item['venue'] = site.xpath('.//@venue').extract

您刚刚忘记打电话给 extract。将其替换为：

item['venue'] = site.xpath('.//@venue').extract()

项目的管道不是 JSON 可序列化

Pipeline for item not JSON serializable

python

serialization

json

scrapy

scrapy-pipeline