如何通过单个管道 运行 多个蜘蛛?

How to run multiple spiders through individual pipelines?

完全菜鸟才开始使用 scrapy。

在我的目录结构中我有这样的...

#FYI: running on Scrapy 2.4.1
WebScraper/
  Webscraper/
     spiders/
        spider.py    # (NOTE: contains spider1 and spider2 classes.)
     items.py
     middlewares.py
     pipelines.py    # (NOTE: contains spider1Pipeline and spider2Pipeline)
     settings.py     # (NOTE: I wrote here:
                     #ITEM_PIPELINES = {
                     #  'WebScraper.pipelines.spider1_pipelines': 300,
                     #  'WebScraper.pipelines.spider2_pipelines': 300,
                     #} 
  scrapy.cfg

并且spider2.py类似于...

class OneSpider(scrapy.Spider):
    name = "spider1"

    def start_requests(self):
        urls = ["url1.com",]
        yield scrapy.Request(
            url="http://url1.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

class TwoSpider(scrapy.Spider):
    name = "spider2"

    def start_requests(self):
        urls = ["url2.com",]
        yield scrapy.Request(
            url="http://url2.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

pipelines.py 看起来像...

class spider1_pipelines(object): 
    def __init__(self): 
        self.csvwriter = csv.writer(open('spider1.csv', 'w', newline=''))
        self.csvwriter.writerow(['header1', 'header2'])
    def process_item(self, item, spider):
        row = []
        row.append(item['header1'])
        row.append(item['header2'])
        self.csvwrite.writerow(row)
        
class spider2_pipelines(object):
    def __init__(self): 
        self.csvwriter = csv.writer(open('spider2.csv', 'w', newline=''))
        self.csvwriter.writerow(['header_a', 'header_b'])
    def process_item(self, item, spider):
        row = []
        row.append(item['header_a']) #NOTE: this is not the same as header1
        row.append(item['header_b']) #NOTE: this is not the same as header2
        self.csvwrite.writerow(row)

我有一个关于 运行 使用一个终端命令在不同的 url 上设置 spider1 和 spider2 的问题:

nohup scrapy crawl spider1 -o spider1_output.csv --logfile spider1.log & scrapy crawl spider2 -o spider2_output.csv --logfile spider2.log

注意:这是针对 (2018) 的先前问题的扩展。

期望的 结果:spider1.csv 数据来自 spider1,spider2.csv 数据来自 spider2。

当前 结果:spider1.csv 数据来自 spider1,spider2.csv BREAKS 但错误日志包含 spider2 数据,并且有一个 keyerror ['header1'],即使 spider2 的项目不包含 header1,它只包含 header_a

有谁知道如何 运行 一个接一个地使用不同的 url,并将 spider1、spider2 等获取的数据插入特定 的管道spider,如 spider1 -> spider1Pipeline -> spider1.csv、spider2 -> spider2Pipelines -> spider2.csv.

或者这可能是从 items.py 中指定 spider1_itemspider2_item 的问题?我想知道我是否可以指定在哪里插入 spider2 的数据。

谢谢!

您可以使用 custom_settings spider 属性实现此功能,以便为每个 spider 单独设置设置

#spider2.py
class OneSpider(scrapy.Spider):
    name = "spider1"
    custom_settings = {
        'ITEM_PIPELINES': {'WebScraper.pipelines.spider1_pipelines': 300}
...
class TwoSpider(scrapy.Spider):
    name = "spider2"
    custom_settings = {
        'ITEM_PIPELINES': {'WebScraper.pipelines.spider2_pipelines': 300}
...