为什么在这个 Scrapy 蜘蛛中项目字段的某些值会重复?
Why are some of the values of items' fields repeating in this Scrapy spider?
当我的蜘蛛在 url 上运行时 this:
def parse_subandtaxonomy(self, response):
item = response.meta['item']
for sub in response.xpath('//div[@class = "page-content"]/section'):
item['Subcategory'] = sub.xpath('h2/text()').extract()
for tax in sub.xpath('ul/li/a'):
item['Taxonomy'] = tax.xpath('text()').extract()
for href in tax.xpath('@href'):
# url = response.urljoin(href.extract()) - > this gave me 301 redirects
badurl = urljoin('https://211sepa.org/search/', href.extract())
url = badurl.replace('search?', 'search/?area_served=Philadelphia&', 1) # shut off to test multi-page
request = scrapy.Request(url, callback=self.parse_listings)
request.meta['item'] = item
yield item
我收到了这个输出,这正是我所期望的:
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Section 8 Vouchers"]}
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Public Housing"]}
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Low Income/ Subsidized Rental Housing"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelter Centralized Intake"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Domestic Violence Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Runaway/ Youth Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Cold Weather Shelters/ Warming Centers"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelter for Pregnant Women"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Rent Payment Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Mortgage Payment Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Landlord/ Tenant Mediation"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["General Dispute Mediation"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Transitional Housing/ Shelter"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Rental Deposit Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Permanent Supportive Housing"]}
但是当我将 yield item
更改为 yield request
以继续抓取时,每个项目都有 {"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Permanent Supportive Housing"] ... other item info ... }
而不是其各自的子类别和分类法。我最终想要从每个分类法中得到的每个项目都被删除了,但它被错误地标记为如上所述。知道发生了什么事吗?
这可能是作用域的问题。您应该始终尝试在尽可能高的范围内创建您的项目以防止数据保留,即如果当前 item
没有 Taxonomy
字段,该对象将保留前一个循环周期的数据。这就是为什么代码应该尽可能在每个循环周期中创建新对象。
试试这个:
def parse_subandtaxonomy(self, response):
for sub in response.xpath('//div[@class = "page-content"]/section'):
subcategory = sub.xpath('h2/text()').extract()
subcategory = sub.xpath('h2/text()').extract_first() # this just takes first element which is nicer!
for tax in sub.xpath('ul/li/a'):
item = response.meta['item'].copy()
item['Subcategory'] = subcategory
item['Taxonomy'] = tax.xpath('text()').extract()
for href in tax.xpath('@href'):
# url = response.urljoin(href.extract()) - > this gave me 301 redirects
badurl = urljoin('https://211sepa.org/search/', href.extract())
url = badurl.replace('search?', 'search/?area_served=Philadelphia&', 1) # shut off to test multi-page
request = scrapy.Request(url,
callback=self.parse_listings,
meta={'item': item}) # you can put meta here directly
yield request
当我的蜘蛛在 url 上运行时 this:
def parse_subandtaxonomy(self, response):
item = response.meta['item']
for sub in response.xpath('//div[@class = "page-content"]/section'):
item['Subcategory'] = sub.xpath('h2/text()').extract()
for tax in sub.xpath('ul/li/a'):
item['Taxonomy'] = tax.xpath('text()').extract()
for href in tax.xpath('@href'):
# url = response.urljoin(href.extract()) - > this gave me 301 redirects
badurl = urljoin('https://211sepa.org/search/', href.extract())
url = badurl.replace('search?', 'search/?area_served=Philadelphia&', 1) # shut off to test multi-page
request = scrapy.Request(url, callback=self.parse_listings)
request.meta['item'] = item
yield item
我收到了这个输出,这正是我所期望的:
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Section 8 Vouchers"]}
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Public Housing"]}
{"Category": ["Housing"], "Subcategory": ["Affordable Housing"], "Taxonomy": ["Low Income/ Subsidized Rental Housing"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelter Centralized Intake"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Domestic Violence Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Runaway/ Youth Shelters"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Cold Weather Shelters/ Warming Centers"]}
{"Category": ["Housing"], "Subcategory": ["Shelter"], "Taxonomy": ["Homeless Shelter for Pregnant Women"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Rent Payment Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Mortgage Payment Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["Landlord/ Tenant Mediation"]}
{"Category": ["Housing"], "Subcategory": ["Stay Housed"], "Taxonomy": ["General Dispute Mediation"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Transitional Housing/ Shelter"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Rental Deposit Assistance"]}
{"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Permanent Supportive Housing"]}
但是当我将 yield item
更改为 yield request
以继续抓取时,每个项目都有 {"Category": ["Housing"], "Subcategory": ["Overcome Homelessness"], "Taxonomy": ["Permanent Supportive Housing"] ... other item info ... }
而不是其各自的子类别和分类法。我最终想要从每个分类法中得到的每个项目都被删除了,但它被错误地标记为如上所述。知道发生了什么事吗?
这可能是作用域的问题。您应该始终尝试在尽可能高的范围内创建您的项目以防止数据保留,即如果当前 item
没有 Taxonomy
字段,该对象将保留前一个循环周期的数据。这就是为什么代码应该尽可能在每个循环周期中创建新对象。
试试这个:
def parse_subandtaxonomy(self, response):
for sub in response.xpath('//div[@class = "page-content"]/section'):
subcategory = sub.xpath('h2/text()').extract()
subcategory = sub.xpath('h2/text()').extract_first() # this just takes first element which is nicer!
for tax in sub.xpath('ul/li/a'):
item = response.meta['item'].copy()
item['Subcategory'] = subcategory
item['Taxonomy'] = tax.xpath('text()').extract()
for href in tax.xpath('@href'):
# url = response.urljoin(href.extract()) - > this gave me 301 redirects
badurl = urljoin('https://211sepa.org/search/', href.extract())
url = badurl.replace('search?', 'search/?area_served=Philadelphia&', 1) # shut off to test multi-page
request = scrapy.Request(url,
callback=self.parse_listings,
meta={'item': item}) # you can put meta here directly
yield request