scrapy 中的项目 vs 项目加载器

Items vs item loaders in scrapy

我是 scrapy 的新手,我知道项目用于填充抓取的数据,但我无法理解项目和项目加载器之间的区别。我试图阅读一些示例代码,他们使用项目加载器来存储而不是项目,我不明白为什么。 Scrapy 文档对我来说不够清晰。任何人都可以就何时使用项目加载器以及它们为项目提供哪些额外设施给出一个简单的解释(更好的例子)?

我很喜欢文档中的官方解释:

Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.

In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.

最后一段应该可以回答您的问题。
项目加载器很棒,因为它们允许您拥有如此多的处理快捷方式并重复使用一堆代码来保持一切整洁、干净和易于理解。

对比示例案例。假设我们要抓取这个项目:

class MyItem(Item):
    full_name = Field()
    bio = Field()
    age = Field()
    weight = Field()
    height = Field()

仅项目方法看起来像这样:

def parse(self, response):
    full_name = response.xpath("//div[contains(@class,'name')]/text()").extract()
    # i.e. returns ugly ['John\n', '\n\t  ', '  Snow']
    item['full_name'] = ' '.join(i.strip() for i in full_name if i.strip())
    bio = response.xpath("//div[contains(@class,'bio')]/text()").extract()
    item['bio'] = ' '.join(i.strip() for i in full_name if i.strip())
    age = response.xpath("//div[@class='age']/text()").extract_first(0)
    item['age'] = int(age) 
    weight = response.xpath("//div[@class='weight']/text()").extract_first(0)
    item['weight'] = int(age) 
    height = response.xpath("//div[@class='height']/text()").extract_first(0)
    item['height'] = int(age) 
    return item

与项目加载器方法对比:

# define once in items.py 
from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst
clean_text = Compose(MapCompose(lambda v: v.strip()), Join())   
to_int = Compose(TakeFirst(), int)

class MyItemLoader(ItemLoader):
    default_item_class = MyItem
    full_name_out = clean_text
    bio_out = clean_text
    age_out = to_int
    weight_out = to_int
    height_out = to_int

# parse as many different places and times as you want  
def parse(self, response):
    loader = MyItemLoader(selector=response)
    loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
    loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
    loader.add_xpath('age', "//div[@class='age']/text()")
    loader.add_xpath('weight', "//div[@class='weight']/text()")
    loader.add_xpath('height', "//div[@class='height']/text()")
    return loader.load_item()

如您所见,Item Loader 更加简洁且易于扩展。假设您有 20 个以上的字段,其中很多字段共享相同的处理逻辑,如果没有 Item Loader,那将是自杀。 物品加载器很棒,你应该使用它们!