将包含单独 JSON 行的文件导入列表

Question

需要一些帮助。

我有一个 JSON 文件，它是 Auth0 导出数据转储的结果。每一行都是 而不是 以逗号分隔。

下面是名为 OUTPUT_USER_DUMP.json

的文件

{"user_id": "auth0|5f9886ee8e36ac0069e8fc3a","name": "John Smith","email": "jsmith@company.com"}
{"user_id": "auth0|5fa43f699e937f0068c40d8e","name": "Bob Anderson","email": "banderson@company.com"}

我想做的是使用 python 脚本打开这个 json 转储文件并将内容分配到列表变量中（打印出列表变量时的示例）

[{"user_id": "auth0|5f9886ee8e36ac0069e8fc3a","name": "John Smith","email": "jsmith@company.com"},
{"user_id": "auth0|5fa43f699e937f0068c40d8e","name": "Bob Anderson","email": "banderson@company.com"}]

有什么帮助吗？

Answer 1

鉴于：

bad_json='''
{
    "user_id": "auth0|5f9886ee8e36ac0069e8fc3a",
    "name": "John Smith",
    "email": "jsmith@company.com"
}
{
    "user_id": "auth0|5fa43f699e937f0068c40d8e",
    "name": "Bob Anderson",
    "email": "banderson@company.com"
}'''

您可以使用正则表达式：

import re 
import json 

t=re.sub(r"\}\s*\{", "},\n{", bad_json)
new_json=rf'[{t}]'

>>> json.loads(new_json)
[{'user_id': 'auth0|5f9886ee8e36ac0069e8fc3a', 'name': 'John Smith', 'email': 'jsmith@company.com'}, {'user_id': 'auth0|5fa43f699e937f0068c40d8e', 'name': 'Bob Anderson', 'email': 'banderson@company.com'}]

编辑

您的文件似乎是 JSON 个人的 LINES。

鉴于：

cat file
{"user_id": "auth0|5f9886ee8e36ac0069e8fc3a","name": "John Smith","email": "jsmith@company.com"}
{"user_id": "auth0|5fa43f699e937f0068c40d8e","name": "Bob Anderson","email": "banderson@company.com"}

您可以遍历文件 line-by-line 并边走边解码：

import json 

with open('/tmp/file') as f:
    data=[json.loads(line) for line in f]

>>> data
[{'user_id': 'auth0|5f9886ee8e36ac0069e8fc3a', 'name': 'John Smith', 'email': 'jsmith@company.com'}, {'user_id': 'auth0|5fa43f699e937f0068c40d8e', 'name': 'Bob Anderson', 'email': 'banderson@company.com'}]

Answer 2

您可以直接用 pandas 读取新行分隔的 JSON 文件。您还可以使用 dataframe

上的 to_dict 函数将其转换为您请求的格式

代码

df = pd.read_json('./OUTPUT_USER_DUMP.json', lines=True)
print(df.to_dict('records'))

输出

[
  {'user_id': 'auth0|5f9886ee8e36ac0069e8fc3a', 'name': 'John Smith', 'email': 'jsmith@company.com'}, 
  {'user_id': 'auth0|5fa43f699e937f0068c40d8e', 'name': 'Bob Anderson', 'email': 'banderson@company.com'}
]

Answer 3

您可以逐行读取文件并将每一行加载为 json 数据：

from json import loads

with open("OUTPUT_USER_DUMP.json", "r") as f2r:
    data = [loads(each_line) for each_line in f2r]
    print(data)

Answer 4

由于文件对象是可迭代的，在 python 中产生它们的行，您可以编写一个函数将每一行作为一个 JSON 对象处理：

from json import dumps, loads
from typing import Iterable, Iterator


FILENAME = 'OUTPUT_USER_DUMP.json'


def read_json_objects(file: Iterable[str]) -> Iterator[dict]:
    """Yield JSON objects from each line of a given file."""

    for line in file:
        if line := line.strip():
            yield loads(line)


def main():
    """Run the script."""

    with open(FILENAME, 'r', encoding='utf-8') as file:
        json = list(read_json_objects(file))

    print(dumps(json, indent=2))


if __name__ == '__main__':
    main()

将包含单独 JSON 行的文件导入列表

Importing a file with individual JSON lines to a list

python

json

auth0