scrapy itemloaders return 项目列表
scrapy itemloaders return list of items
def parse:
for link in LinkExtractor(restrict_xpaths="BLAH",).extract_links(response)[:-1]:
yield Request(link.url)
l = MytemsLoader()
l.add_value('main1', some xpath)
l.add_value('main2', some xpath)
l.add_value('main3', some xpath)
rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]")
for row in rows:
l.add_value('table1', some xpath based on rows)
l.add_value('table2', some xpath based on rows)
l.add_value('main3', some xpath based on rows)
yield l.loaditem()
我正在使用项目加载器,因为我想预处理这些字段并轻松处理任何空值。
table 的每一行都应该是一个实体,它具有 main1、2、3...等字段加上自己的字段。
但是,上面的代码覆盖了 l itemloader,只是返回每个主页的最后一行。
问题:
如何使用 itemloader 将主页数据与每个 table 行条目组合?如果我为每个部分使用 2 个项目加载器,它们如何组合?
供日后参考:
def newparse:
for link in LinkExtractor(restrict_xpaths="BLAH",).extract_links(response)[:-1]:
yield Request(link.url)
ml = MyitemLoader()
ml.add_value('main1', some xpath)
ml.add_value('main2', some xpath)
ml.add_value('main3', some xpath)
main_item = ml.load_item()
rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]")
for row in rows:
bl = MyitemLoader(item=main_item, selector=row)
bl.add_value('table1', some xpath based on row)
bl.add_value('table2', some xpath based on row)
bl.add_value('main3', some xpath based on row)
yield bl.loaditem()
您需要在提供 item
argument:
的循环中实例化一个新的 ItemLoader
l = MytemsLoader()
l.add_value('main1', some xpath)
l.add_value('main2', some xpath)
l.add_value('main3', some xpath)
item = l.loaditem()
rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]")
for row in rows:
l = MytemsLoader(item=item)
l.add_value('table1', some xpath based on rows)
l.add_value('table2', some xpath based on rows)
l.add_value('main3', some xpath based on rows)
yield l.loaditem()
def parse:
for link in LinkExtractor(restrict_xpaths="BLAH",).extract_links(response)[:-1]:
yield Request(link.url)
l = MytemsLoader()
l.add_value('main1', some xpath)
l.add_value('main2', some xpath)
l.add_value('main3', some xpath)
rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]")
for row in rows:
l.add_value('table1', some xpath based on rows)
l.add_value('table2', some xpath based on rows)
l.add_value('main3', some xpath based on rows)
yield l.loaditem()
我正在使用项目加载器,因为我想预处理这些字段并轻松处理任何空值。 table 的每一行都应该是一个实体,它具有 main1、2、3...等字段加上自己的字段。 但是,上面的代码覆盖了 l itemloader,只是返回每个主页的最后一行。
问题: 如何使用 itemloader 将主页数据与每个 table 行条目组合?如果我为每个部分使用 2 个项目加载器,它们如何组合?
供日后参考:
def newparse:
for link in LinkExtractor(restrict_xpaths="BLAH",).extract_links(response)[:-1]:
yield Request(link.url)
ml = MyitemLoader()
ml.add_value('main1', some xpath)
ml.add_value('main2', some xpath)
ml.add_value('main3', some xpath)
main_item = ml.load_item()
rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]")
for row in rows:
bl = MyitemLoader(item=main_item, selector=row)
bl.add_value('table1', some xpath based on row)
bl.add_value('table2', some xpath based on row)
bl.add_value('main3', some xpath based on row)
yield bl.loaditem()
您需要在提供 item
argument:
ItemLoader
l = MytemsLoader()
l.add_value('main1', some xpath)
l.add_value('main2', some xpath)
l.add_value('main3', some xpath)
item = l.loaditem()
rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]")
for row in rows:
l = MytemsLoader(item=item)
l.add_value('table1', some xpath based on rows)
l.add_value('table2', some xpath based on rows)
l.add_value('main3', some xpath based on rows)
yield l.loaditem()