如何序列化作为 XML 导出器中项目列表的 Scrapy 字段
How to Serialize Scrapy Fields that are Lists of Items in XML Exporter
我构建了复杂的项目,其中的字段可能是其他项目类型的列表。当我使用默认 XmlItemExporter
导出它时,子列表项以 <value>
标签为前缀。我正在寻找如何将子项标识符分配给那些值标签的示例。
文档的 Item Exporters 页面解释了这句话:
Unless overridden in the serialize_field()
method, multi-valued fields are exported by serializing each value inside a <value>
element. This is for convenience, as multi-valued fields are very common.
文档页面还提供了有关在现场声明序列化程序和覆盖Serialize_Field()方法的简单示例, 但两者都是针对单值字段的,没有关于如何为多值字段自定义它们的建议。
我在网上搜索了如何实现的示例,但没有找到。
这是我用于测试的示例项目树:
class Course(scrapy.Item):
title = scrapy.Field()
lessons = scrapy.Field()
class Lesson(scrapy.Item):
session = scrapy.Field()
topic = scrapy.Field()
assignment = scrapy.Field()
class ReadingAssignment(scrapy.Item):
textBook = scrapy.Field()
pages = scrapy.Field()
course = Course()
course['title'] = 'Greatness'
course['lessons'] = []
lesson = Lesson()
lesson['session'] = 'Week 1'
lesson['topic'] = 'Think Great'
lesson['assignment'] = []
reading = ReadingAssignment()
reading['textBook'] = 'Great Book 1'
reading['pages'] = '1-20'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)
lesson = Lesson()
lesson['session'] = 'Week 2'
lesson['topic'] = 'Act Great'
lesson['assignment'] = []
reading = ReadingAssignment()
reading['textBook'] = 'Great Book 2'
reading['pages'] = '21-40'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)
lesson = Lesson()
lesson['session'] = 'Week 3'
lesson['topic'] = 'Look Great'
lesson['assignment'] = []
reading = ReadingAssignment()
reading['textBook'] = 'Great Book 3'
reading['pages'] = '41-60'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)
lesson = Lesson()
lesson['session'] = 'Week 4'
lesson['topic'] = 'Be Great'
lesson['assignment'] = []
reading = ReadingAssignment()
reading['textBook'] = 'Great Book 4'
reading['pages'] = '61-80'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)
输出:
>>> course
{'lessons': [{'assignment': [{'pages': '1-20', 'textBook': 'Great Book 1'}],
'session': 'Week 1',
'topic': 'Think Great'},
{'assignment': [{'pages': '21-40', 'textBook': 'Great Book 2'}],
'session': 'Week 2',
'topic': 'Act Great'},
{'assignment': [{'pages': '41-60', 'textBook': 'Great Book 3'}],
'session': 'Week 3',
'topic': 'Look Great'},
{'assignment': [{'pages': '61-80', 'textBook': 'Great Book 4'}],
'session': 'Week 4',
'topic': 'Be Great'}],
'title': 'Greatness'}
当我 运行 通过 XmlItemExporter
我得到:
<?xml version="1.0" encoding="utf-8"?>
<items>
<course>
<title>Greatness</title>
<lessons>
<value>
<session>Week 1</session>
<topic>Think Great</topic>
<assignment>
<value>
<textBook>Great Book 1</textBook>
<pages>1-20</pages>
</value>
</assignment>
</value>
<value>
<session>Week 2</session>
<topic>Act Great</topic>
<assignment>
<value>
<textBook>Great Book 2</textBook>
<pages>21-40</pages>
</value>
</assignment>
</value>
<value>
<session>Week 3</session>
<topic>Look Great</topic>
<assignment>
<value>
<textBook>Great Book 3</textBook>
<pages>41-60</pages>
</value>
</assignment>
</value>
<value>
<session>Week 4</session>
<topic>Be Great</topic>
<assignment>
<value>
<textBook>Great Book 4</textBook>
<pages>61-80</pages>
</value>
</assignment>
</value>
</lessons>
</course>
</items>
我想做的是将那些 <value>
标签更改为附加到列表中的项目的名称。像这样:
<items>
<course>
<title>Greatness</title>
<lessons>
<lesson>
<session>Week 1</session>
<topic>Think Great</topic>
<assignment>
<reading>
<textBook>Great Book 1</textBook>
<pages>1-20</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 2</session>
<topic>Act Great</topic>
<assignment>
<reading>
<textBook>Great Book 2</textBook>
<pages>21-40</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 3</session>
<topic>Look Great</topic>
<assignment>
<reading>
<textBook>Great Book 3</textBook>
<pages>41-60</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 4</session>
<topic>Be Great</topic>
<assignment>
<reading>
<textBook>Great Book 4</textBook>
<pages>61-80</pages>
</reading>
</assignment>
</lesson>
</lessons>
</course>
</items>
这确实没有很好的记录,我们将不得不求助于阅读 XmlItemExporter
source code, where it turns out that the <value>
tag choice has been hard-coded in the XmlItemExporter._export_xml_field()
method:
elif is_listlike(serialized_value):
self._beautify_newline()
for value in serialized_value:
self._export_xml_field('value', value, depth=depth+1)
self._beautify_indent(depth=depth)
幸运的是,还有出路,在之前的几行中:
if hasattr(serialized_value, 'items'):
self._beautify_newline()
for subname, value in serialized_value.items():
self._export_xml_field(subname, value, depth=depth+1)
self._beautify_indent(depth=depth)
这意味着要处理 字典 ,但实际上它会接受任何具有 .items()
方法的任何东西,即 returns 字符串和项目的元组!
但是,导出器中缺少一个重要步骤:递归。您基本上只能在 top-level 项目字段上设置 serializer
标志,Item
subclass 上的任何 Field()
元素超出 top-level 项目是当前的 Scrapy 实现完全忽略了。每个导出器在如何驱动内部 BaseItemExporter._get_serialized_fields()
method 方面都有自己的特点,因此我们不能像每个特定的导出器(JSON、XML 等)一样预先处理递归。他们需要序列化字段的方式不同。我们可以使用 XmlItemExporter
class 的子 class 来解决这个问题,详情见下文。
所以这里的第一个技巧是创建一个具有 .items()
方法并为您提供 <container>
标签的专用对象。请注意,您必须自己处理序列化的递归! Scrapy 序列化器本身不处理嵌套结构的递归:
class CustomXMLValuesSerializer:
@classmethod
def serialize_as(cls, name):
def serializer(items, serialize):
return cls(name, items, serialize)
return serializer
def __init__(self, name, items, serialize=None):
self._name = name
self._items = items
self._serialize = serialize if serialise is not None else lambda x: x
def items(self):
for item in self._items:
yield (self._name, self._serialize(item))
然后使用 CustomXMLValuesSerializer.serialize_as()
class 方法为您的列表字段创建自定义序列化程序:
class Course(scrapy.Item):
title = scrapy.Field()
lessons = scrapy.Field(
serializer=CustomXMLValuesSerializer.serialize_as("lesson")
)
class Lesson(scrapy.Item):
session = scrapy.Field()
topic = scrapy.Field()
assignment = scrapy.Field(
serializer=CustomXMLValuesSerializer.serialize_as("reading")
)
class ReadingAssignment(scrapy.Item):
textBook = scrapy.Field()
pages = scrapy.Field()
最后,我们需要一个稍微自定义的导出器,它实际上可以让我们递归地处理嵌套项:
from functools import partial
class RecursingXmlItemExporter(XmlItemExporter):
def _recursive_serialized_fields(self, item):
if isinstance(item, scrapy.Item):
return dict(self._get_serialized_fields(item, default_value=''))
return item
def serialize_field(self, field, name, value):
serializer = field.get('serializer', lambda x: x)
try:
return serializer(value, self._recursive_serialized_fields)
except TypeError:
return serializer(value)
请注意,这会传入 default_value=''
,因为 that's what the base XmlItemExporter.export_item()
implementation uses。
确保使用此自定义导出器,因为它传入所需的上下文以序列化嵌套项:
exporter = RecursingXmlItemExporter(some_file, indent=2, item_element='course')
exporter.start_exporting()
exporter.export_item(course)
exporter.finish_exporting()
现在容器实际上是使用 name
字符串作为容器元素导出的:
<?xml version="1.0" encoding="utf-8"?>
<items>
<course>
<title>Greatness</title>
<lessons>
<lesson>
<session>Week 1</session>
<topic>Think Great</topic>
<assignment>
<reading>
<textBook>Great Book 1</textBook>
<pages>1-20</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 2</session>
<topic>Act Great</topic>
<assignment>
<reading>
<textBook>Great Book 2</textBook>
<pages>21-40</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 3</session>
<topic>Look Great</topic>
<assignment>
<reading>
<textBook>Great Book 3</textBook>
<pages>41-60</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 4</session>
<topic>Be Great</topic>
<assignment>
<reading>
<textBook>Great Book 4</textBook>
<pages>61-80</pages>
</reading>
</assignment>
</lesson>
</lessons>
</course>
</items>
我用 Scrapy 字段 issue #3888 看看项目是否有兴趣更好地支持嵌套 Item
结构。
另一种方法是通过单独调用 XmlItemExporter.export_item()
方法来导出嵌套项,但这要求导出器可以在与序列化器相同的命名空间中作为全局访问,或者您子class 导出器并...将导出器传递给序列化程序。然后你必须满足于 XmlItemExporter.export_item()
hard-codes 缩进的事实。
我构建了复杂的项目,其中的字段可能是其他项目类型的列表。当我使用默认 XmlItemExporter
导出它时,子列表项以 <value>
标签为前缀。我正在寻找如何将子项标识符分配给那些值标签的示例。
文档的 Item Exporters 页面解释了这句话:
Unless overridden in the
serialize_field()
method, multi-valued fields are exported by serializing each value inside a<value>
element. This is for convenience, as multi-valued fields are very common.
文档页面还提供了有关在现场声明序列化程序和覆盖Serialize_Field()方法的简单示例, 但两者都是针对单值字段的,没有关于如何为多值字段自定义它们的建议。
我在网上搜索了如何实现的示例,但没有找到。
这是我用于测试的示例项目树:
class Course(scrapy.Item):
title = scrapy.Field()
lessons = scrapy.Field()
class Lesson(scrapy.Item):
session = scrapy.Field()
topic = scrapy.Field()
assignment = scrapy.Field()
class ReadingAssignment(scrapy.Item):
textBook = scrapy.Field()
pages = scrapy.Field()
course = Course()
course['title'] = 'Greatness'
course['lessons'] = []
lesson = Lesson()
lesson['session'] = 'Week 1'
lesson['topic'] = 'Think Great'
lesson['assignment'] = []
reading = ReadingAssignment()
reading['textBook'] = 'Great Book 1'
reading['pages'] = '1-20'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)
lesson = Lesson()
lesson['session'] = 'Week 2'
lesson['topic'] = 'Act Great'
lesson['assignment'] = []
reading = ReadingAssignment()
reading['textBook'] = 'Great Book 2'
reading['pages'] = '21-40'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)
lesson = Lesson()
lesson['session'] = 'Week 3'
lesson['topic'] = 'Look Great'
lesson['assignment'] = []
reading = ReadingAssignment()
reading['textBook'] = 'Great Book 3'
reading['pages'] = '41-60'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)
lesson = Lesson()
lesson['session'] = 'Week 4'
lesson['topic'] = 'Be Great'
lesson['assignment'] = []
reading = ReadingAssignment()
reading['textBook'] = 'Great Book 4'
reading['pages'] = '61-80'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)
输出:
>>> course
{'lessons': [{'assignment': [{'pages': '1-20', 'textBook': 'Great Book 1'}],
'session': 'Week 1',
'topic': 'Think Great'},
{'assignment': [{'pages': '21-40', 'textBook': 'Great Book 2'}],
'session': 'Week 2',
'topic': 'Act Great'},
{'assignment': [{'pages': '41-60', 'textBook': 'Great Book 3'}],
'session': 'Week 3',
'topic': 'Look Great'},
{'assignment': [{'pages': '61-80', 'textBook': 'Great Book 4'}],
'session': 'Week 4',
'topic': 'Be Great'}],
'title': 'Greatness'}
当我 运行 通过 XmlItemExporter
我得到:
<?xml version="1.0" encoding="utf-8"?>
<items>
<course>
<title>Greatness</title>
<lessons>
<value>
<session>Week 1</session>
<topic>Think Great</topic>
<assignment>
<value>
<textBook>Great Book 1</textBook>
<pages>1-20</pages>
</value>
</assignment>
</value>
<value>
<session>Week 2</session>
<topic>Act Great</topic>
<assignment>
<value>
<textBook>Great Book 2</textBook>
<pages>21-40</pages>
</value>
</assignment>
</value>
<value>
<session>Week 3</session>
<topic>Look Great</topic>
<assignment>
<value>
<textBook>Great Book 3</textBook>
<pages>41-60</pages>
</value>
</assignment>
</value>
<value>
<session>Week 4</session>
<topic>Be Great</topic>
<assignment>
<value>
<textBook>Great Book 4</textBook>
<pages>61-80</pages>
</value>
</assignment>
</value>
</lessons>
</course>
</items>
我想做的是将那些 <value>
标签更改为附加到列表中的项目的名称。像这样:
<items>
<course>
<title>Greatness</title>
<lessons>
<lesson>
<session>Week 1</session>
<topic>Think Great</topic>
<assignment>
<reading>
<textBook>Great Book 1</textBook>
<pages>1-20</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 2</session>
<topic>Act Great</topic>
<assignment>
<reading>
<textBook>Great Book 2</textBook>
<pages>21-40</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 3</session>
<topic>Look Great</topic>
<assignment>
<reading>
<textBook>Great Book 3</textBook>
<pages>41-60</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 4</session>
<topic>Be Great</topic>
<assignment>
<reading>
<textBook>Great Book 4</textBook>
<pages>61-80</pages>
</reading>
</assignment>
</lesson>
</lessons>
</course>
</items>
这确实没有很好的记录,我们将不得不求助于阅读 XmlItemExporter
source code, where it turns out that the <value>
tag choice has been hard-coded in the XmlItemExporter._export_xml_field()
method:
elif is_listlike(serialized_value):
self._beautify_newline()
for value in serialized_value:
self._export_xml_field('value', value, depth=depth+1)
self._beautify_indent(depth=depth)
幸运的是,还有出路,在之前的几行中:
if hasattr(serialized_value, 'items'):
self._beautify_newline()
for subname, value in serialized_value.items():
self._export_xml_field(subname, value, depth=depth+1)
self._beautify_indent(depth=depth)
这意味着要处理 字典 ,但实际上它会接受任何具有 .items()
方法的任何东西,即 returns 字符串和项目的元组!
但是,导出器中缺少一个重要步骤:递归。您基本上只能在 top-level 项目字段上设置 serializer
标志,Item
subclass 上的任何 Field()
元素超出 top-level 项目是当前的 Scrapy 实现完全忽略了。每个导出器在如何驱动内部 BaseItemExporter._get_serialized_fields()
method 方面都有自己的特点,因此我们不能像每个特定的导出器(JSON、XML 等)一样预先处理递归。他们需要序列化字段的方式不同。我们可以使用 XmlItemExporter
class 的子 class 来解决这个问题,详情见下文。
所以这里的第一个技巧是创建一个具有 .items()
方法并为您提供 <container>
标签的专用对象。请注意,您必须自己处理序列化的递归! Scrapy 序列化器本身不处理嵌套结构的递归:
class CustomXMLValuesSerializer:
@classmethod
def serialize_as(cls, name):
def serializer(items, serialize):
return cls(name, items, serialize)
return serializer
def __init__(self, name, items, serialize=None):
self._name = name
self._items = items
self._serialize = serialize if serialise is not None else lambda x: x
def items(self):
for item in self._items:
yield (self._name, self._serialize(item))
然后使用 CustomXMLValuesSerializer.serialize_as()
class 方法为您的列表字段创建自定义序列化程序:
class Course(scrapy.Item):
title = scrapy.Field()
lessons = scrapy.Field(
serializer=CustomXMLValuesSerializer.serialize_as("lesson")
)
class Lesson(scrapy.Item):
session = scrapy.Field()
topic = scrapy.Field()
assignment = scrapy.Field(
serializer=CustomXMLValuesSerializer.serialize_as("reading")
)
class ReadingAssignment(scrapy.Item):
textBook = scrapy.Field()
pages = scrapy.Field()
最后,我们需要一个稍微自定义的导出器,它实际上可以让我们递归地处理嵌套项:
from functools import partial
class RecursingXmlItemExporter(XmlItemExporter):
def _recursive_serialized_fields(self, item):
if isinstance(item, scrapy.Item):
return dict(self._get_serialized_fields(item, default_value=''))
return item
def serialize_field(self, field, name, value):
serializer = field.get('serializer', lambda x: x)
try:
return serializer(value, self._recursive_serialized_fields)
except TypeError:
return serializer(value)
请注意,这会传入 default_value=''
,因为 that's what the base XmlItemExporter.export_item()
implementation uses。
确保使用此自定义导出器,因为它传入所需的上下文以序列化嵌套项:
exporter = RecursingXmlItemExporter(some_file, indent=2, item_element='course')
exporter.start_exporting()
exporter.export_item(course)
exporter.finish_exporting()
现在容器实际上是使用 name
字符串作为容器元素导出的:
<?xml version="1.0" encoding="utf-8"?>
<items>
<course>
<title>Greatness</title>
<lessons>
<lesson>
<session>Week 1</session>
<topic>Think Great</topic>
<assignment>
<reading>
<textBook>Great Book 1</textBook>
<pages>1-20</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 2</session>
<topic>Act Great</topic>
<assignment>
<reading>
<textBook>Great Book 2</textBook>
<pages>21-40</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 3</session>
<topic>Look Great</topic>
<assignment>
<reading>
<textBook>Great Book 3</textBook>
<pages>41-60</pages>
</reading>
</assignment>
</lesson>
<lesson>
<session>Week 4</session>
<topic>Be Great</topic>
<assignment>
<reading>
<textBook>Great Book 4</textBook>
<pages>61-80</pages>
</reading>
</assignment>
</lesson>
</lessons>
</course>
</items>
我用 Scrapy 字段 issue #3888 看看项目是否有兴趣更好地支持嵌套 Item
结构。
另一种方法是通过单独调用 XmlItemExporter.export_item()
方法来导出嵌套项,但这要求导出器可以在与序列化器相同的命名空间中作为全局访问,或者您子class 导出器并...将导出器传递给序列化程序。然后你必须满足于 XmlItemExporter.export_item()
hard-codes 缩进的事实。