使用 JSON 架构将 CSV 转换为 JSON

Convert a CSV into a JSON using a JSON Schema

如何将单位 table 转换为 JSON?

我之前使用自定义代码和库将 JSONs 转换为平面表格。然而,我在这里要做的是相反的。在继续创建自定义库之前,我想知道以前是否有人遇到过这个问题,是否有现成的解决方案。

当您将 JSON 扁平化为 CSV 时,您丢失了结构上的信息,因此要反转它,您需要一个描述 JSON 应该如何构建的文档,它理想情况下是标准化的 JSON Schema.

以下示例显示源 CSV、JSON 架构和预期输出。

用户 CSV

user_id, adress.city, address.street, address.number, name, aka, contacts.name, contacts.relationship
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter

JSON 架构

这遵循定义的标准并添加了“来源”属性。我建议将此自定义 属性 添加到此特定问题,以便在 csv 列和 JSON 值(叶)之间进行映射。

{
 "$schema": "https://json-schema.org/draft/2020-12/schema",
 "title": "User",
 "type": "object",
 "properties":{
  "user_id" : {"type":"integer", "source":"user_id"},
  "address":{
   "type":"object",
   "properties":{
    "city" : {"type":"string", "source":"adress.city"},
    "street" : {"type":"string", "source":"adress.street"},
    "number": { "type":"integer", "source":"adress.number"}
   }
  },
  "name" : {"type":"string", "source":"name"}},
  "aka":{
   "type": "array",
   "items" : {"type":"string", "source":"aka"}
  },
  "contacts":{
   "type":"array",
   "items":{
    "type":"object",
    "properties":{
     "name" : {"type":"string", "source":"contacts.name"},
     "relationship":{"type":"string", "source":"contacts.relationship"}
    },
   }
  }
 }
}

预计JSON

{
 "user_id":1,
 "address":{
  "city":"Seattle",
  "street":"Atomic Street",
  "number":6910
 },
 "name":"Rick Sanchez",
 "aka":[
  "Rick",
  "Grandpa",
  "Albert Ein-douche",
  "Richard"
 ],
 "contacts":[
  {
   "name":"Morty",
   "relationship":"Grandson"
  },
  {
   "name":"Beth",
   "relationship":"Daughter"
  }
 ]
}

从上面我们看到,虽然 CSV 中有 8 行,但我们生成了一个 JSON 对象(而不是 8 个),因为只有一个唯一用户 (user_id = 1).这可以从 JSON 模式中推断出来,其中根元素是一个对象而不是列表。

如果我们没有指定 JSON 架构或其他类型的映射,您可以简单地假设没有结构,只需创建 8 个平面 json,如下所示

[
 {"user_id":1,"address.city":"Seattle", ... "aka":"Rick" ... "contacts.relationship":"Grandson"}
 ...
 {"user_id":1,"address.city":"Seattle", ... "aka":"Richard" ... "contacts.relationship":"Daughter"}
]

我正在添加 Python 标签,因为这是我最常使用的语言,但在这种情况下,解决方案不需要在 Python.

我不完全清楚为什么需要 JSON 模式,但如果你愿意,你可以轻松创建一个方便的函数,它基本上可以“展开”平面 JSON 你的 CSV 数据将被映射到上面提到的嵌套字典格式。

以下示例应演示其工作原理的简化示例。注意以下两点:

  • 在 CSV header 中,我更正了一个拼写错误并将其中一列重命名为 address.city;以前,它是 adress.city,这将导致它被映射到单独的 adress 键下的另一个 JSON 路径,这可能是不可取的。

  • 我不确定处理这个问题的最佳方法,但看起来 csv 模块只允许 single-character 分隔符;在 CSV 文件中,看起来你有一个逗号和一个 space , 作为分隔符,所以我只是用一个逗号 , 替换了所有出现的这个,这样拆分分隔符按预期工作。

from csv import DictReader
from io import StringIO
from typing import Any


csv_data = StringIO("""\
user_id, address.city, address.street, address.number, name, aka, contacts.name, contacts.relationship
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter
""".replace(', ', ',')
)


def unflatten_json(json_dict: dict):
    """Unflatten a JSON dictionary object, with keys like 'a.b.c'"""
    result_dict = {}

    for k, v in json_dict.items():
        *nested_parts, field_name = k.split('.')

        obj = result_dict
        for p in nested_parts:
            obj = obj.setdefault(p, {})

        obj[field_name] = v

    return result_dict


def main():
    reader = DictReader(csv_data)
    flat_json: list[dict[str, Any]] = list(reader)

    first_obj = flat_json[0]
    nested_dict = unflatten_json(first_obj)

    print('Flat JSON:   ', first_obj)
    print('Nested JSON: ', nested_dict)


if __name__ == '__main__':
    main()

输出如下:

Flat JSON:    {'user_id': '1', 'address.city': 'Seattle', 'address.street': 'Atomic Street', 'address.number': '6910', 'name': 'Rick Sanchez', 'aka': 'Rick', 'contacts.name': 'Morty', 'contacts.relationship': 'Grandson'}
Nested JSON:  {'user_id': '1', 'address': {'city': 'Seattle', 'street': 'Atomic Street', 'number': '6910'}, 'name': 'Rick Sanchez', 'aka': 'Rick', 'contacts': {'name': 'Morty', 'relationship': 'Grandson'}}

请注意,如果您想展开列表中的所有 JSON 字典 object,您可以改为使用 list 理解,如下所示:

result_list = [unflatten_json(d) for d in flat_json]

我还要指出,上述解决方案并不完美,因为它会将所有内容作为字符串值传递,例如在 'user_id': '1' 的情况下。要解决此问题,您可以修改 unflatten_json 函数,如下所示:

...
for k, v in json_dict.items():
    ...

    try:
        v = int(v)
    except ValueError:
        pass

    obj[field_name] = v

现在未展开的 JSON object 应该如下所示。请注意,我用 json.dumps(nested_dict, indent=2) 打印出来,这样更容易看到。

{
  "user_id": 1,
  "address": {
    "city": "Seattle",
    "street": "Atomic Street",
    "number": 6910
  },
  "name": "Rick Sanchez",
  "aka": "Rick",
  "contacts": {
    "name": "Morty",
    "relationship": "Grandson"
  }
}

完整的解决方案

下面提供了实现所需输出(附加到 akacontacts 的所有行的数据)的完整解决方案:

from csv import DictReader
from io import StringIO
from pprint import pprint


csv_data = StringIO("""\
user_id, address.city, address.street, address.number, name, aka, contacts.name, contacts.relationship
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Morty, Grandson
1, Seattle, Atomic Street, 6910, Rick Sanchez, Rick, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Grandpa, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Albert Ein-douche, Beth, Daughter
1, Seattle, Atomic Street, 6910, Rick Sanchez, Richard, Beth, Daughter
""".replace(', ', ',')
)


def unflatten_json(json_dict: dict[str, str]):
    """Unflatten a JSON dictionary object, with keys like 'a.b.c'"""
    result_dict = {}

    for k, v in json_dict.items():
        *nested_parts, field_name = k.split('.')

        obj = result_dict
        for p in nested_parts:
            obj = obj.setdefault(p, {})

        obj[field_name] = int(v) if v.isnumeric() else v

    return result_dict


def main():
    reader = DictReader(csv_data)

    rows = list(map(unflatten_json, reader))

    # retrieve the first element in the (unflattened) sequence
    result_obj = rows[0]
    # define list fields that we want to merge data for
    list_fields = ('aka', 'contacts')
    # now loop through, and for all rows merge the data for these fields
    for field in list_fields:
        result_obj[field] = [row[field] for row in rows]

    print('Result object:')
    pprint(result_obj)


if __name__ == '__main__':
    main()

这应该会得到预期的结果,如问题中所述:

Result object:
{'address': {'city': 'Seattle', 'number': 6910, 'street': 'Atomic Street'},
 'aka': ['Rick',
         'Grandpa',
         'Albert Ein-douche',
         'Richard',
         'Rick',
         'Grandpa',
         'Albert Ein-douche',
         'Richard'],
 'contacts': [{'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Morty', 'relationship': 'Grandson'},
              {'name': 'Beth', 'relationship': 'Daughter'},
              {'name': 'Beth', 'relationship': 'Daughter'},
              {'name': 'Beth', 'relationship': 'Daughter'},
              {'name': 'Beth', 'relationship': 'Daughter'}],
 'name': 'Rick Sanchez',
 'user_id': 1}

如前所述,JSON 模式无法帮助您转换数据。不过,它可以帮助您验证结果。
为了让每个用户获得一个条目,我认为你应该按 ["user_id", "address.city", "address.street", "address.number", "name"] 对你的 DataFrame 进行分组。这些值对用户来说应该是不变的。
然后聚合剩余的列以创建列表。

我创建了通用函数来展开字典并将列表合并到字典中。您可以摆脱递归,因为在您的情况下,一切都在顶层完成:

import json
import pandas as pd

df = pd.read_csv("file.csv", sep=", ", engine="python")
df = df.groupby(["user_id", "address.city", "address.street", "address.number", "name"], as_index=False).agg(lambda x: list(x))

#print(df) # uncomment to see the transformation

json_data = df.to_dict(orient="records")

def unflatten_dic(dic):
    for k,v in list(dic.items()):
        subkeys = k.split('.')
        if len(subkeys) > 1:
            dic.setdefault(subkeys[0],dict())
            dic[subkeys[0]].update({"".join(subkeys[1:]): v})
            unflatten_dic(dic[subkeys[0]])
            del(dic[k])


def merge_lists(dic):
    for k,v in list(dic.items()):
        if isinstance(v, dict):
            keys = list(v.keys())
            vals = list(v.values())
            if all(isinstance(l, list) and len(l)==len(vals[0]) for l in vals):
                dic[k] = []
                val_tuple = set(zip(*vals)) # removing duplicates with set()
                for t in val_tuple:
                    dic[k].append({subkey: t[i] for i, subkey in enumerate(keys)})
            else:
                merge_lists(v)
        elif isinstance(v, list):
            dic[k] = list(set(v))   # removing list duplicates
                    

for user in json_data:
    unflatten_dic(user)
    merge_lists(user)

print(json.dumps(json_data, indent=4))

输出:

[
    {
        "user_id": 1,
        "name": "Rick Sanchez",
        "aka": [
            "Richard",
            "Grandpa",
            "Albert Ein-douche",
            "Rick"
        ],
        "address": {
            "city": "Seattle",
            "street": "Atomic Street",
            "number": 6910
        },
        "contacts": [
            {
                "name": "Morty",
                "relationship": "Grandson"
            },
            {
                "name": "Beth",
                "relationship": "Daughter"
            }
        ]
    }
]