删除 None 值的管道
Pipeline to remove None values
我的蜘蛛程序生成了某些数据,但有时找不到数据。
而不是设置如下条件:
if response.xpath('//div[@id="mitten"]//h1/text()').extract_first():
result['name'] = response.xpath('//div[@id="mitten"]//h1/text()').extract_first()
我宁愿通过删除所有具有 None
值的项目来解决我的管道中的这个问题。我尝试通过以下代码来做到这一点:
class BasicPipeline(object):
""" Basic pipeline for scrapers """
def __init__(self):
self.seen = set()
def process_item(self, item, spider):
item = dict((k,v) for k,v in item.iteritems() if v is not None)
item['date'] = datetime.date.today().strftime("%d-%m-%y")
for key, value in item.iteritems():
if isinstance(value, basestring):
item[key] = value.strip() # strip every value of the item
# If an address is a list, convert it to a string
if "address" in item:
if isinstance(item['address'], list): # check if address is a list
item['address'] = u", ".join(line.strip() for line in item['address'] if len(line.strip()) > 0)
# Determine the currency of the price if possible
if "price" in item:
if u'€' in item['price'] or 'EUR' in item['price']:
item['currency'] = 'EUR'
elif u'$' in result['price'] or 'USD' in item['price']:
item['currency'] = 'USD'
# Extract e-mails from text
if "email" in item:
if isinstance(item['email'], list): # check if email is a list
item['email'] = u" ".join(line.strip() for line in item['email']) # convert to a string
regex = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
item['email'] = u";".join(line.strip() for line in re.findall(regex, item['email']))
if "mailto:" in item['email']:
item['email'] = item.replace("mailto:","")
if "phone" in item or "email" in item:
return item
else:
DropItem("No contact details: %s" %item)
但是,这会导致错误:
2018-03-05 10:11:03 [scrapy] ERROR: Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x103c14dd0>>
Traceback (most recent call last):
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 57, in robustApply
return receiver(*arguments, **named)
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 193, in item_scraped
slot.exporter.export_item(item)
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/exporters.py", line 184, in export_item
self._write_headers_and_set_fields_to_export(item)
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/exporters.py", line 199, in _write_headers_and_set_fields_to_export
self.fields_to_export = list(item.fields.keys())
AttributeError: 'NoneType' object has no attribute 'fields'
我认为这与以下事实有关:一个字段已被提供给管道,但最后没有返回,但这只是一个猜测。
目前管道有如下语句:
if "website" in item:
# Do stuff
而且我想防止添加不必要的额外语句来检查值是否为 None
。
如果您返回创建的项目,您当前的代码可能会起作用:
def process_item(self, item, spider):
item = dict((k,v) for k,v in item.iteritems() if v is not None)
return item
也就是说,我强烈建议在您的 scrapy 蜘蛛中使用 item loaders。
不为空数据创建字段只是众多好处之一。
编辑:
现在您已经包含了完整的管道代码,我可以看到错误出现在最后一行。
您的代码创建一个异常对象,将其丢弃,然后 returns None
;必须提出 DropItem
异常:
raise DropItem("No contact details: %s" % item)
我的蜘蛛程序生成了某些数据,但有时找不到数据。 而不是设置如下条件:
if response.xpath('//div[@id="mitten"]//h1/text()').extract_first():
result['name'] = response.xpath('//div[@id="mitten"]//h1/text()').extract_first()
我宁愿通过删除所有具有 None
值的项目来解决我的管道中的这个问题。我尝试通过以下代码来做到这一点:
class BasicPipeline(object):
""" Basic pipeline for scrapers """
def __init__(self):
self.seen = set()
def process_item(self, item, spider):
item = dict((k,v) for k,v in item.iteritems() if v is not None)
item['date'] = datetime.date.today().strftime("%d-%m-%y")
for key, value in item.iteritems():
if isinstance(value, basestring):
item[key] = value.strip() # strip every value of the item
# If an address is a list, convert it to a string
if "address" in item:
if isinstance(item['address'], list): # check if address is a list
item['address'] = u", ".join(line.strip() for line in item['address'] if len(line.strip()) > 0)
# Determine the currency of the price if possible
if "price" in item:
if u'€' in item['price'] or 'EUR' in item['price']:
item['currency'] = 'EUR'
elif u'$' in result['price'] or 'USD' in item['price']:
item['currency'] = 'USD'
# Extract e-mails from text
if "email" in item:
if isinstance(item['email'], list): # check if email is a list
item['email'] = u" ".join(line.strip() for line in item['email']) # convert to a string
regex = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
item['email'] = u";".join(line.strip() for line in re.findall(regex, item['email']))
if "mailto:" in item['email']:
item['email'] = item.replace("mailto:","")
if "phone" in item or "email" in item:
return item
else:
DropItem("No contact details: %s" %item)
但是,这会导致错误:
2018-03-05 10:11:03 [scrapy] ERROR: Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x103c14dd0>>
Traceback (most recent call last):
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 57, in robustApply
return receiver(*arguments, **named)
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 193, in item_scraped
slot.exporter.export_item(item)
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/exporters.py", line 184, in export_item
self._write_headers_and_set_fields_to_export(item)
File "/Users/casper/Documents/crawling/env/lib/python2.7/site-packages/scrapy/exporters.py", line 199, in _write_headers_and_set_fields_to_export
self.fields_to_export = list(item.fields.keys())
AttributeError: 'NoneType' object has no attribute 'fields'
我认为这与以下事实有关:一个字段已被提供给管道,但最后没有返回,但这只是一个猜测。
目前管道有如下语句:
if "website" in item:
# Do stuff
而且我想防止添加不必要的额外语句来检查值是否为 None
。
如果您返回创建的项目,您当前的代码可能会起作用:
def process_item(self, item, spider):
item = dict((k,v) for k,v in item.iteritems() if v is not None)
return item
也就是说,我强烈建议在您的 scrapy 蜘蛛中使用 item loaders。
不为空数据创建字段只是众多好处之一。
编辑:
现在您已经包含了完整的管道代码,我可以看到错误出现在最后一行。
您的代码创建一个异常对象,将其丢弃,然后 returns None
;必须提出 DropItem
异常:
raise DropItem("No contact details: %s" % item)