将嵌套 JSON 结构反序列化为 Django 模型对象

Deserialize nested JSON structures to Django model objects

我需要使用发送 JSON 包含 JSON 序列化嵌套结构的响应的服务,我想将其反序列化并存储在我的数据库中 - 我的应用程序使用 Django。

业务规则如下:

  1. 查询 returns 对象总是有一个 id 属性 这是一个唯一的整数,通常是 createdAt 属性和一个 updatedAt 属性,都带有日期时间数据,然后是其他几个基本类型的属性(int、float、str、datetime 等),以及几个可以是另一个对象或一个对象的属性对象数组。

  2. 如果 属性 值是一个对象,则父级通过 'foreign key' 与其关联。如果它是一个对象数组,那么我们有两种情况:数组的对象通过 'foreign key' 与父对象相关,或者父对象和数组的每个成员通过 'many-to-many' 相关关系.

  3. 我需要在我的数据库中镜像每个对象,所以每个模型都有一个 id 字段作为主键,但它不是自动生成的,因为真实的 ID 将是随导入数据一起提供。

  4. 所有这些实体之间的关系已经反映在我的模型架构中。 我采用了这种方法(镜像数据结构),因为如果我将接收到的数据展平以将其全部保存到一个table,将会出现可怕的复制,违反所有数据规范化规则。

  5. 对于每个根对象,我需要这样做:

    • 检查数据库中是否已有该 id
    • 的记录
    • 创建一个新记录以防万一
    • 更新现有记录,以防已有记录(如果 updatedAt 记录和传入数据的值相同,则可能会跳过更新
    • 为每个嵌套对象递归地重复这些相同的步骤,每个嵌套对象是其父属性之一的提供值。

下面我重现了一个非常简化的示例,其中包含我从服务收到的数据以及我要存储它的模型。真正的东西比那庞大和复杂得多,这就是为什么我如此想学习一种让 ORM 解决问题的方法,如果它能够的话到。硬编码整个事情需要永远,除了很容易出错,如果将来数据模式发生变化,还会造成维护困难。

编辑:A link 到 previous simplified version of the following JSON and Models*

JSON样本:

{
  "id": 37125965,
  "number": "029073432019403",
  "idCommunication": "1843768",
  "docReceivedAt": {
    "date": "2019-12-20 08:46:42"
  },
  "createdAt": {
    "date": "2019-12-20 09:01:14"
  },
  "updatedAt": {
    "date": "2019-12-20 09:01:32"
  },
  "branch": {
    "id": 20,
    "name": "REGIONAL OFFICE #3",
    "address": "457 Beau St., S\u00e3o Paulo, SP, 08547-003",
    "active": true,
    "createdAt": {
      "date": "2013-02-14 23:12:30"
    },
    "updatedAt": {
      "date": "2019-05-09 13:40:47"
    }
  },
  "modality": {
    "id": 1,
    "valor": "CITA\u00c7\u00c3O",
    "descricao": "CITA\u00c7\u00c3O",
    "active": true,
    "createdAt": {
      "date": "2014-08-29 20:47:56"
    },
    "updatedAt": {
      "date": "2014-08-29 20:47:56"
    }
  },
  "operation": {
    "id": 12397740,
    "number": "029073432019403",
    "startedAt": {
      "date": "2019-11-07 22:28:25"
    },
    "managementType": 27,
    "assessmentValue": 5000000,
    "createdAt": {
      "date": "2019-12-20 09:01:30"
    },
    "updatedAt": {
      "date": "2019-12-20 09:01:30"
    },
    "operationClass": {
      "id": 22,
      "name": "A\u00c7\u00c3O RESCIS\u00d3RIA",
      "createdAt": {
        "date": "2014-02-28 20:24:55"
      },
      "updatedAt": {
        "date": "2014-02-28 20:24:55"
      }
    },
    "evaluator": {
      "id": 26798,
      "name": "JANE DOE",
      "level": 1,
      "active": true,
      "createdAt": {
        "date": "2017-02-22 22:54:04"
      },
      "updatedAt": {
        "date": "2017-03-15 18:03:20"
      },
      "evaluatorsOffice": {
        "id": 7,
        "name": "ACME",
        "area": 4,
        "active": true,
        "createdAt": {
          "date": "2014-02-28 20:25:16"
        },
        "updatedAt": {
          "date": "2014-02-28 20:25:16"
        }
      },
      "evaluatorsOffice_id": 7
    },
    "operationClass_id": 22,
    "evaluator_id": 26798
  },
  "folder": {
    "id": 16901241,
    "singleDocument": false,
    "state": 0,
    "IFN": "00409504174201972",
    "closed": false,
    "dataHoraAbertura": {
      "date": "2019-12-20 09:01:31"
    },
    "dataHoraTransicao": {
      "date": "2024-12-20 09:01:31"
    },
    "titulo": "CONTROL FOLDER REF. OP. N. 029073432019403",
    "createdAt": {
      "date": "2019-12-20 09:01:32"
    },
    "updatedAt": {
      "date": "2019-12-20 09:01:32"
    },
    "subjects": [
      {
        "id": 22255645,
        "main": true,
        "createdAt": {
          "date": "2019-12-20 09:01:32"
        },
        "updatedAt": {
          "date": "2019-12-20 09:01:32"
        },
        "subjectClass": {
          "id": 20872,
          "name": "SPECIAL RETIREMENT PROCESS",
          "active": true,
          "regulation": "8.213/91, 53.831/64, 83.080/79, 2.172/97, 1.663/98, 9.711/98, 9.528/97 AND 9.032/95",
          "glossary": "SPECIAL RETIREMENT APPLICATION DUE TO HAZARDOUS LABOR CONDITION FOR 15+/20+/25+ YEARS",
          "createdAt": {
            "date": "2013-10-18 16:22:44"
          },
          "updatedAt": {
            "date": "2013-10-18 16:22:44"
          },
          "parent": {
            "id": 20866,
            "name": "RETIREMENT BENEFITS",
            "active": true,
            "createdAt": {
              "date": "2013-10-18 16:22:44"
            },
            "updatedAt": {
              "date": "2013-10-18 16:22:44"
            },
            "parent": {
              "id": 20126,
              "name": "SOCIAL SECURITY",
              "active": true,
              "createdAt": {
                "date": "2013-10-18 16:22:42"
              },
              "updatedAt": {
                "date": "2013-10-18 16:22:42"
              }
            },
            "parent_id": 20126
          },
          "parent_id": 20866
        },
        "subjectClass_id": 20872
      }
    ],
    "person": {
      "id": 7318,
      "isClient": true,
      "isRelated": false,
      "name": "SOCSEC CO.",
      "createdAt": {
        "date": "2013-02-14 23:11:43"
      },
      "updatedAt": {
        "date": "2019-11-18 16:05:07"
      }
    },
    "operation": {
      "id": 12397740,
      "number": "029073432019403",
      "startedAt": {
        "date": "2019-11-07 22:28:25"
      },
      "managementType": 27,
      "assessmentValue": 5000000,
      "createdAt": {
        "date": "2019-12-20 09:01:30"
      },
      "updatedAt": {
        "date": "2019-12-20 09:01:30"
      }
    },
    "section": {
      "id": 311,
      "name": "PROTOCOL",
      "address": "457 Beau St., ground floor, S\u00e3o Paulo, SP, 08547-003",
      "active": true,
      "management": false,
      "onlyDistribution": true,
      "createdAt": {
        "date": "2013-02-14 23:12:31"
      },
      "updatedAt": {
        "date": "2019-07-05 16:40:34"
      },
      "branch": {
        "id": 20,
        "name": "REGIONAL OFFICE #3",
        "address": "457 Beau St., S\u00e3o Paulo, SP, 08547-003",
        "active": true,
        "createdAt": {
          "date": "2013-02-14 23:12:30"
        },
        "updatedAt": {
          "date": "2019-05-09 13:40:47"
        }
      },
      "branch_id": 20
    },
    "person_id": 7318,
    "operation_id": 12397740,
    "section_id": 311
  },
  "branch_id": 20,
  "modality_id": 1,
  "operation_id": 12397740,
  "folder_id": 16901241
}

Models.py样本:

from django.db import models


class Section(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    name = models.CharField(max_length=255, null=True)
    address = models.CharField(max_length=255, null=True)
    active = models.BooleanField(default=True)
    management = models.BooleanField(default=False)
    onlyDistribution = models.BooleanField(default=False)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()
    branch = models.ForeignKey('Branch', null=True, on_delete=models.SET_NULL)


class Person(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    name = models.CharField(max_length=255, null=True)
    isClient = models.BooleanField(default=True)
    isRelated = models.BooleanField(default=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()


class SubjectClass(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    name = models.CharField(max_length=255, null=True)
    active = models.BooleanField(default=True)
    regulation = models.CharField(max_length=255, null=True)
    glossary = models.CharField(max_length=255, null=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()
    parent = models.ForeignKey('SubjectClass', null=True, on_delete=models.SET_NULL)


class Subject(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    main = models.BooleanField(default=False)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()
    folder = models.ForeignKey('Folder', null=True, on_delete=models.SET_NULL)
    subjectClass = models.ForeignKey(SubjectClass, null=True, on_delete=models.SET_NULL)


class Folder(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    singleDocument = models.BooleanField(default=False)
    state = models.PositiveSmallIntegerField(null=True)
    IFN = models.CharField(max_length=31, null=True)
    closed = models.BooleanField(default=False)
    title = models.CharField(max_length=255, null=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()
    subjects = models.ManyToManyField(SubjectClass, through=Subject, through_fields=('folder', 'subjectClass'))
    interestedEntity = models.ForeignKey(Person, null=True, on_delete=models.SET_NULL)


class EvaluatorsOffice(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    name = models.CharField(max_length=255, null=True)
    area = models.PositiveSmallIntegerField(null=True)
    active = models.BooleanField(default=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()


class Evaluator(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    name = models.CharField(max_length=255, null=True)
    level = models.PositiveSmallIntegerField(null=True)
    active = models.BooleanField(default=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()
    evaluatorsOffice = models.ForeignKey(EvaluatorsOffice, null=True, on_delete=models.SET_NULL)


class OperationClass(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    name = models.CharField(max_length=255, null=True)
    active = models.BooleanField(default=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()


class Operation(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    number = models.CharField(max_length=31, null=True)
    startedAt = models.DateTimeField(null=True)
    managementType = models.PositiveIntegerField(null=True)
    assessmentValue = models.PositiveIntegerField(null=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()
    operationClass = models.ForeignKey(OperationClass, null=True, on_delete=models.SET_NULL)
    evaluator = models.ForeignKey(Evaluator, null=True, on_delete=models.SET_NULL)


class Branch(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    name = models.CharField(max_length=255, null=True)
    address = models.CharField(max_length=255, null=True)
    active = models.BooleanField(default=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()


class Modality(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    value = models.CharField(max_length=255, null=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()


class CommunicationRecord(models.Model):
    id = models.PositiveIntegerField(primary_key=True)
    number = models.CharField(max_length=31, null=True)
    idCommunication = models.CharField(max_length=31, null=True)
    docReceivedAt = models.DateTimeField(null=True)
    createdAt = models.DateTimeField()
    updatedAt = models.DateTimeField()
    branch = models.ForeignKey(Branch, null=True, on_delete=models.SET_NULL)
    modality = models.ForeignKey(Modality, null=True, on_delete=models.SET_NULL)
    operation = models.ForeignKey(Operation, null=True, on_delete=models.SET_NULL)
    folder = models.ForeignKey(Folder, null=True, on_delete=models.SET_NULL)

编辑(参考 DRF Serializers):

我正在尝试遵循 Max Malysh I Reinstate Monica 的建议,并开始研究递归序列化程序:

from django.db.models import Manager, Model, Field, DateTimeField, ForeignKey
from rest_framework.serializers import ModelSerializer


class RecursiveSerializer(ModelSerializer):
    manager: Manager
    field_dict: dict

    def __init__(self, target_manager: Manager, data: dict, **kwargs):
        self.manager = target_manager
        self.Meta.model = self.manager.model
        self.field_dict = {f.name: f for f in self.manager.model._meta.fields}
        instance = None
        data = self.process_data(data)
        pk_name = self.manager.model._meta.pk.name
        if pk_name in data:
            try:
                instance = target_manager.get(pk=data[pk_name])
            except target_manager.model.DoesNotExist:
                pass
        super().__init__(instance, data, **kwargs)

    def process_data(self, data: dict):
        processed_data = {}
        for name, value in data.items():
            field: Field = self.field_dict.get(name)
            if isinstance(value, dict):
                if isinstance(field, ForeignKey):
                    processed_data[name] = self.__class__(field.related_model.objects, data=value)
                    continue
                elif len(value) == 1 and 'date' in value and isinstance(field, DateTimeField):
                    processed_data[name] = value['date']
                    continue
            processed_data[name] = value
        return processed_data

    class Meta:
        model: Model = None
        fields = '__all__'

但是,它做了一件奇怪的事情:当第一个 运行,针对一个空数据库时,它只创建最后一个也是嵌套最深的对象。在第二个 运行 中,它什么都不做,并且 returns 一个 code='unique' 验证错误,表明这样的对象已经存在。

现在我必须说我对 Python 和 Django(我来自 .NET 开发)还很陌生,我在这个任务中面临的困难开始让我觉得很尴尬。我一直在阅读有关 Django 和 DRF 的文档,它们对我的帮助低于我的预期。然而,我拒绝相信上述语言和框架缺乏执行如此微不足道的操作的资源。所以,如果我缺少一些非常明显的东西,看起来,因为缺乏我的知识,如果有人教我一些我在这里似乎不知道的东西,我将不胜感激。

好的,所以我放弃了使用 DRF,只是为我的其他模型创建了一个扩展的抽象模型来继承,并赋予它我需要的功能,实现如下所示。

from datetime import datetime
from django.db import models
from django.db.models import FieldDoesNotExist
from django.utils import timezone


class UpdateableModel(models.Model):
    class Meta:
        abstract = True

    @classmethod
    def creupdate(cls, data: dict, save_main_instance: bool = True):
        primary_key_name = cls._meta.pk.name
        if primary_key_name not in data:
            raise ValueError(f'parameter \'data\' must contain \'{primary_key_name}\' key (model\'s primary key).')
        try:
            instance = cls.objects.get(pk=data[primary_key_name])
            at_instance = getattr(instance, 'atualizadoEm', None)
            at_data = data.get('atualizadoEm', None)
            if at_instance and at_data and at_instance >= at_data:
                print(f'unchanged instance {primary_key_name} {instance.pk} from {instance._meta.model}')
                return instance
            operation = 'updated'
        except cls.DoesNotExist:
            instance = cls()
            operation = 'created'
        many_to_many_instances = []
        for name, value in data.items():
            if isinstance(value, dict):
                if len(value) == 1 and 'date' in value:
                    date_value = datetime.strptime(value['date'], '%Y-%m-%d %H:%M:%S')
                    if timezone.is_naive(date_value):
                        date_value = timezone.make_aware(date_value)
                    setattr(instance, name, date_value)
                else:
                    foreign_key = cls._meta.get_field(name)
                    foreign_model = foreign_key.related_model
                    foreign_data: dict = value
                    foreign_instance = foreign_model.creupdate(foreign_data)
                    setattr(instance, name, foreign_instance)
            elif isinstance(value, list):
                try:
                    relation_field = cls._meta.get_field(name)
                except FieldDoesNotExist:
                    relation_field = None
                if relation_field:
                    for through_data in value:
                        try:
                            through_model = getattr(instance, name).through
                            if isinstance(through_data,dict) and through_model._meta.pk.name in through_data:
                                many_to_many_instances.append(through_model.creupdate(through_data, False))
                        except ValueError:
                            pass
            else:
                setattr(instance, name, value)
        if save_main_instance:
            instance.save()
        print(f'{operation} instance {primary_key_name} {instance.pk} from {instance._meta.model}')
        for many_to_many_instance in many_to_many_instances:
            many_to_many_instance.save()
        return instance

现在,虽然它有效(我刚刚使用它导入了大量数据),但我现在不会将其标记为答案,原因有二:

  1. 我愿意听到对我的实现的批评,这将指出缺陷和使其更健壮和优化的方法。

  2. 我仍然希望有比我更好的解决方案。只有在几个月过去但什么也没有出现的情况下,我才会假设有 none 并接受我自己的答案。

一般来说,为什么我同意 DRF 对这种情况没有用:DRF 定义了一个 API 并且在许多方面类似于视图而不是模型:它定义了应该导出数据的哪一部分。它能够支持对同一数据结构的所有 CRUD 操作。同一数据可能有更多 APIs。因此,序列化器与模型分离是正常的。如果第三方包应该成为新 API 的一部分,也经常需要不更改模型中的任何内容。您只需要创建和更新(不需要读取或删除)。您确认不需要任何复杂的安全限制。

EDIT 我的代码的主要特征为您更新的 JSON 和模型将与您的更相似。那没有意义。我会写更多的评论和更少的更改代码,因为它可能导致模型的无限增长和 JSON 来解释为什么你,例如。忽略一些错误。

您的作业的重要信息是:1. "through" 实体在所有多对多关系中的数据都出现在 JSON 中(以前从未出现过) 2. changedAt 根实体的时间戳随 JSON 中嵌套实体的每次更改而更新,包括所有中间实体,甚至包括 "through" 实体。

from datetime import datetime
from django.db import models
from django.utils import timezone


class UpdateableModel(models.Model):
    class Meta:
        abstract = True

    @classmethod
    def creupdate(cls, data: dict, save_main_instance: bool = True, no_optimization=False):
        primary_key_name = cls._meta.pk.name
        if primary_key_name not in data:
            raise ValueError(f'parameter \'data\' must contain \'{primary_key_name}\' key (model\'s primary key).')
        try:
            instance = cls.objects.get(pk=data[primary_key_name])
            at_instance = getattr(instance, 'atualizadoEm', None)
            at_data = data.get('atualizadoEm', None)
            operation = 'unchanged' if at_instance and at_data and at_instance >= at_data else 'updated'
            if operation == 'unchanged' and not no_optimizations:
                print(f'{operation} instance {primary_key_name} {instance.pk} from {instance._meta.model}')
                return instance
        except cls.DoesNotExist:
            instance = cls()
            operation = 'created'
        many_to_many_instances = []
        for name, value in data.items():
            if isinstance(value, dict):
                if len(value) == 1 and 'date' in value:
                    date_value = datetime.strptime(value['date'], '%Y-%m-%d %H:%M:%S')
                    if timezone.is_naive(date_value):
                        date_value = timezone.make_aware(date_value)
                    new_value = date_value
                else:
                    foreign_key = cls._meta.get_field(name)
                    foreign_model = foreign_key.related_model
                    foreign_data: dict = value
                    foreign_instance = foreign_model.creupdate(foreign_data)
                    new_value = foreign_instance
            elif isinstance(value, list):
                remote_field = getattr(instance, name)
                obj_ids = []
                for remote_data in value:
                    assert isinstance(remote_data, dict) and remote_field.model._meta.pk.name in remote_data
                    obj_ids.append(remote_field.model.creupdate(remote_data, False).pk)
                many_to_many_instances.append((remote_field, obj_ids))
            else:
                new_value = value
            if operation != 'unchanged':
                setattr(instance, name, new_value)
        if save_main_instance and operation != 'unchanged':
            instance.save()
        print(f'{operation} instance {primary_key_name} {instance.pk} from {instance._meta.model}')
        for remote_field, obj_ids in many_to_many_instances:
            remote_field.add(*obj_ids)
        return instance

备注:

  • 多对多关系经过优化,可以通过一个请求添加所有对象,从而在没有任何更改的情况下最大限度地减少保存次数。 (它是为之前的 JSON 结构编写的,没有任何明确的 "through" 数据)

  • 添加了 assert 而不是 try ... 除了 ValueError: pass(或 FieldDoesNotExist)。
    "Errors should never pass silently." Zen of Python - 特别是在开发中。 (未知 through 名称与未知普通属性的错误类似。)

  • 添加了一个参数 "no_optimization" 并让我的逻辑仅对同一实体使用 "modifiedAt",而不是跳过对相关实体的检查。如果发生错误,或者如果 FieldDoesNotExist 错误地忽略了更新,则稍后可以通过使用 no_optimization=True 重放数据来更新数据库的状态。如果所有实体都使用时间戳,那么它甚至是 idempotent 并且可以以任何随机顺序处理数据,例如通过重复一段时间内有一些错误的数据。通过使用和不使用优化处理获得相同的数据库状态对于检查优化也很有用 - 例如通过比较导出的 sql 转储。我的经验是,如果没有准备好替代方法,过分依赖时间戳的优化会在很久以后成为问题。

根据文档,这就是方法。

class UserSerializer(serializers.Serializer):
    email = serializers.EmailField()
    username = serializers.CharField(max_length=100)

class CommentSerializer(serializers.Serializer):
    user = UserSerializer()
    content = serializers.CharField(max_length=200)
    created = serializers.DateTimeField()

From Django Rest Framework: nested serializers