将 file_name 参数传递给管道以在 scrapy 中导出 csv
Pass file_name argument to pipeline for csv export in scrapy
我需要 scrapy 从命令行获取一个参数 (-a FILE_NAME="stuff") 并将其应用于 pipelines.py 文件中我的 CSVWriterPipeLine 中创建的文件。 (我选择 pipeline.py 的原因是内置导出器重复数据并在输出文件中重复 header。相同的代码,但写入管道修复了它。)
我尝试从 scrapy.utils.project 导入 get_project_settings 如
中所示
How to access scrapy settings from item Pipeline
但我无法从命令行更改文件名。
我也试过在页面上实施@avaleske 的解决方案,因为它专门解决了这个问题,但我不知道将他谈到的代码放在我的 scrapy 文件夹中的什么地方。
帮忙?
settings.py:
BOT_NAME = 'internal_links'
SPIDER_MODULES = ['internal_links.spiders']
NEWSPIDER_MODULE = 'internal_links.spiders'
CLOSESPIDER_PAGECOUNT = 100
ITEM_PIPELINES = ['internal_links.pipelines.CsvWriterPipeline']
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'internal_links (+http://www.mycompany.com)'
FILE_NAME = "mytestfilename"
pipelines.py:
import csv
class CsvWriterPipeline(object):
def __init__(self, file_name):
header = ["URL"]
self.file_name = file_name
self.csvwriter = csv.writer(open(self.file_name, 'wb'))
self.csvwriter.writerow(header)
def process_item(self, item, internallinkspider):
# build your row to export, then export the row
row = [item['url']]
self.csvwriter.writerow(row)
return item
spider.py:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from internal_links.items import MyItem
class MySpider(CrawlSpider):
name = 'internallinkspider'
allowed_domains = ['angieslist.com']
start_urls = ['http://www.angieslist.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=True), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
return item
您可以使用 "settings" 概念和 -s
命令行参数:
scrapy crawl internallinkspider -s FILE_NAME="stuff"
然后,在管道中:
import csv
class CsvWriterPipeline(object):
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
file_name = settings.get("FILE_NAME")
return cls(file_name)
def __init__(self, file_name):
header = ["URL"]
self.csvwriter = csv.writer(open(file_name, 'wb'))
self.csvwriter.writerow(header)
def process_item(self, item, internallinkspider):
# build your row to export, then export the row
row = [item['url']]
self.csvwriter.writerow(row)
return item
我需要 scrapy 从命令行获取一个参数 (-a FILE_NAME="stuff") 并将其应用于 pipelines.py 文件中我的 CSVWriterPipeLine 中创建的文件。 (我选择 pipeline.py 的原因是内置导出器重复数据并在输出文件中重复 header。相同的代码,但写入管道修复了它。)
我尝试从 scrapy.utils.project 导入 get_project_settings 如
中所示How to access scrapy settings from item Pipeline
但我无法从命令行更改文件名。
我也试过在页面上实施@avaleske 的解决方案,因为它专门解决了这个问题,但我不知道将他谈到的代码放在我的 scrapy 文件夹中的什么地方。
帮忙?
settings.py:
BOT_NAME = 'internal_links'
SPIDER_MODULES = ['internal_links.spiders']
NEWSPIDER_MODULE = 'internal_links.spiders'
CLOSESPIDER_PAGECOUNT = 100
ITEM_PIPELINES = ['internal_links.pipelines.CsvWriterPipeline']
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'internal_links (+http://www.mycompany.com)'
FILE_NAME = "mytestfilename"
pipelines.py:
import csv
class CsvWriterPipeline(object):
def __init__(self, file_name):
header = ["URL"]
self.file_name = file_name
self.csvwriter = csv.writer(open(self.file_name, 'wb'))
self.csvwriter.writerow(header)
def process_item(self, item, internallinkspider):
# build your row to export, then export the row
row = [item['url']]
self.csvwriter.writerow(row)
return item
spider.py:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from internal_links.items import MyItem
class MySpider(CrawlSpider):
name = 'internallinkspider'
allowed_domains = ['angieslist.com']
start_urls = ['http://www.angieslist.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=True), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
return item
您可以使用 "settings" 概念和 -s
命令行参数:
scrapy crawl internallinkspider -s FILE_NAME="stuff"
然后,在管道中:
import csv
class CsvWriterPipeline(object):
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
file_name = settings.get("FILE_NAME")
return cls(file_name)
def __init__(self, file_name):
header = ["URL"]
self.csvwriter = csv.writer(open(file_name, 'wb'))
self.csvwriter.writerow(header)
def process_item(self, item, internallinkspider):
# build your row to export, then export the row
row = [item['url']]
self.csvwriter.writerow(row)
return item