Scrapy:如何在一个爬虫 运行 中将抓取的数据存储在不同的 json 文件中?

Scrapy: How to store scraped data in different json files within one crawler run?

我在 start_urls 字段中使用带有多个 URL 列表的通用蜘蛛。

是否可以为每个 URL 导出一个 json 文件?

据我所知,只能为一个特定的输出文件设置一个路径。

任何解决此问题的想法都会得到回报!

编辑:这是我的蜘蛛 class:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):
    name = 'my_spider'
    start_urls =  start_urls = ['www.domain1.com','www.domain2.com', 
   'www.domain3.com']


    custom_settings = {
                'FEED_EXPORT_ENCODING': 'utf-8',
                'DEPTH_LIMIT': '1',
                'FEED_URI': 'file:///C:/path/to/result.json',
    }

    rules = (
        Rule(LinkExtractor(allow=r"abc"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        all_text = response.xpath("//p/text()").getall()

        yield {
            "text": " ".join(all_text),
            "url": response.url,
        }

第一个选项

您可以将蜘蛛中的项目保存为Scrapy tutorial 例如:

import scrapy
import json

DICT = {
    'https://quotes.toscrape.com/page/1/': 'domain1.json',
    'https://quotes.toscrape.com/page/2/': 'domain2.json',
}


class MydomainSpider(scrapy.Spider):
    name = "mydomain"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
        'https://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        filename = DICT[response.url]
        with open(filename, 'w') as fp:
            json.dump({"content": response.body.decode("utf-8")}, fp)

DICT 变量仅用于指定 JSON 文件名,但您也可以使用域作为文件名。

第二个选项

您可以尝试在 pipelines.py 中使用 process_item,如下所示:

from scrapy.exporters import JsonItemExporter


class SaveJsonPipeline:
    def process_item(self, item, spider):
       filename = item['filename']
       del item['filename']
       JsonItemExporter(open(filename, "wb")).export_item(item)
       return item

item['filename']是为每个start_url保存文件名。你也需要设置items.py,例如:

import scrapy


class MydomainItem(scrapy.Item):
    filename = scrapy.Field()
    content = scrapy.Field()

你的蜘蛛:

import scrapy
from ..items import MydomainItem


DICT = {
    'https://quotes.toscrape.com/page/1/': 'domain1.json',
    'https://quotes.toscrape.com/page/2/': 'domain2.json',
}


class MydomainSpider(scrapy.Spider):
    name = 'mydomain'
    allowed_domains = ['mydomain.com']
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
        'https://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        item = MydomainItem()
        item["filename"] = DICT[response.url]
        item["content"] = response.body.decode("utf-8")
        yield item

在 运行 之前,您需要在设置中添加管道。

ITEM_PIPELINES = {
    'myproject.pipelines.SaveJsonPipeline': 300,
}