项目的管道不是 JSON 可序列化
Pipeline for item not JSON serializable
我正在尝试将抓取的 xml 的输出写入 json。由于项目不可序列化,抓取失败。
根据这个问题,它建议您需要构建管道,未提供超出问题范围的答案 SO scrapy serializer
所以参考scrapy docs
它说明了一个例子,但是文档然后建议不要使用这个
The purpose of JsonWriterPipeline is just to introduce how to write
item pipelines. If you really want to store all scraped items into a
JSON file you should use the Feed exports.
如果我去 feed exports 会显示这个
JSON
FEED_FORMAT: json Exporter used: JsonItemExporter See this warning if
you’re using JSON with large feeds.
我的问题仍然存在,因为据我所知是从命令行执行的。
scrapy runspider myxml.py -o ~/items.json -t json
但是,这造成了我打算使用管道解决的错误。
TypeError: <bound method SelectorList.extract of [<Selector xpath='.//@venue' data=u'Royal Randwick'>]> is not JSON serializable
如何创建 json 管道来纠正 json 序列化错误?
这是我的代码。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.selector import XmlXPathSelector
from conv_xml.items import ConvXmlItem
#
import json
class MyxmlSpider(scrapy.Spider):
name = "myxml"
start_urls = (
["file:///home/sayth/Downloads/20160123RAND0.xml"]
)
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//meeting')
items = []
for site in sites:
item = ConvXmlItem()
item['venue'] = site.xpath('.//@venue').extract
item['name'] = site.xpath('.//race/@id').extract()
item['url'] = site.xpath('.//race/@number').extract()
item['description'] = site.xpath('.//race/@distance').extract()
items.append(item)
return items
# class JsonWriterPipeline(object):
#
# def __init__(self):
# self.file = open('items.jl', 'wb')
#
# def process_item(self, item, spider):
# line = json.dumps(dict(item)) + "\n"
# self.file.write(line)
# return item
问题出在这里:
item['venue'] = site.xpath('.//@venue').extract
您刚刚忘记打电话给 extract
。将其替换为:
item['venue'] = site.xpath('.//@venue').extract()
我正在尝试将抓取的 xml 的输出写入 json。由于项目不可序列化,抓取失败。
根据这个问题,它建议您需要构建管道,未提供超出问题范围的答案 SO scrapy serializer
所以参考scrapy docs 它说明了一个例子,但是文档然后建议不要使用这个
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
如果我去 feed exports 会显示这个
JSON
FEED_FORMAT: json Exporter used: JsonItemExporter See this warning if you’re using JSON with large feeds.
我的问题仍然存在,因为据我所知是从命令行执行的。
scrapy runspider myxml.py -o ~/items.json -t json
但是,这造成了我打算使用管道解决的错误。
TypeError: <bound method SelectorList.extract of [<Selector xpath='.//@venue' data=u'Royal Randwick'>]> is not JSON serializable
如何创建 json 管道来纠正 json 序列化错误?
这是我的代码。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.selector import XmlXPathSelector
from conv_xml.items import ConvXmlItem
#
import json
class MyxmlSpider(scrapy.Spider):
name = "myxml"
start_urls = (
["file:///home/sayth/Downloads/20160123RAND0.xml"]
)
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//meeting')
items = []
for site in sites:
item = ConvXmlItem()
item['venue'] = site.xpath('.//@venue').extract
item['name'] = site.xpath('.//race/@id').extract()
item['url'] = site.xpath('.//race/@number').extract()
item['description'] = site.xpath('.//race/@distance').extract()
items.append(item)
return items
# class JsonWriterPipeline(object):
#
# def __init__(self):
# self.file = open('items.jl', 'wb')
#
# def process_item(self, item, spider):
# line = json.dumps(dict(item)) + "\n"
# self.file.write(line)
# return item
问题出在这里:
item['venue'] = site.xpath('.//@venue').extract
您刚刚忘记打电话给 extract
。将其替换为:
item['venue'] = site.xpath('.//@venue').extract()