为什么在这个 Scrapy 蜘蛛中项目字段的某些值会重复?

Why are some of the values of items' fields repeating in this Scrapy spider?

当我的蜘蛛在 url 上运行时 this:

def parse_subandtaxonomy(self, response):
item = response.meta['item']
for sub in response.xpath('//div[@class = "page-content"]/section'):
    item['Subcategory'] = sub.xpath('h2/text()').extract()
    for tax in sub.xpath('ul/li/a'):
        item['Taxonomy'] = tax.xpath('text()').extract()
        for href in tax.xpath('@href'):
            # url = response.urljoin(href.extract()) - > this gave me 301 redirects
            badurl = urljoin('https://211sepa.org/search/', href.extract())
            url = badurl.replace('search?', 'search/?area_served=Philadelphia&', 1) # shut off to test multi-page
            request = scrapy.Request(url, callback=self.parse_listings)
            request.meta['item'] = item
            yield item

我收到了这个输出,这正是我所期望的:

{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Section 8 Vouchers"]}
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Public Housing"]}
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Low Income/ Subsidized Rental Housing"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelter Centralized Intake"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Domestic Violence Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Runaway/ Youth Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Cold Weather Shelters/ Warming Centers"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelter for Pregnant Women"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Rent Payment Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Mortgage Payment Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Landlord/ Tenant Mediation"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["General Dispute Mediation"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Transitional Housing/ Shelter"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Rental Deposit Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Permanent Supportive Housing"]}

但是当我将 yield item 更改为 yield request 以继续抓取时,每个项目都有 {"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Permanent Supportive Housing"] ... other item info ... } 而不是其各自的子类别和分类法。我最终想要从每个分类法中得到的每个项目都被删除了,但它被错误地标记为如上所述。知道发生了什么事吗?

这可能是作用域的问题。您应该始终尝试在尽可能高的范围内创建您的项目以防止数据保留,即如果当前 item 没有 Taxonomy 字段,该对象将保留前一个循环周期的数据。这就是为什么代码应该尽可能在每个循环周期中创建新对象。

试试这个:

def parse_subandtaxonomy(self, response):
    for sub in response.xpath('//div[@class = "page-content"]/section'):
        subcategory = sub.xpath('h2/text()').extract()
        subcategory = sub.xpath('h2/text()').extract_first()  # this just takes first element which is nicer!
        for tax in sub.xpath('ul/li/a'):
            item = response.meta['item'].copy()
            item['Subcategory'] = subcategory
            item['Taxonomy'] = tax.xpath('text()').extract()
            for href in tax.xpath('@href'):
                # url = response.urljoin(href.extract()) - > this gave me 301 redirects
                badurl = urljoin('https://211sepa.org/search/', href.extract())
                url = badurl.replace('search?', 'search/?area_served=Philadelphia&', 1) # shut off to test multi-page
                request = scrapy.Request(url, 
                                         callback=self.parse_listings,
                                         meta={'item': item})  # you can put meta here directly
                yield request