将两个 jsonl(json 行)文件合并并写入 python3.6 中的新 jsonl 文件

merge & write two jsonl (json lines) files into a new jsonl file in python3.6

你好,我有两个 jsonl 文件,如下所示:

one.jsonl

{"name": "one", "description": "testDescription...", "comment": "1"}
{"name": "two", "description": "testDescription2...", "comment": "2"}

second.jsonl

{"name": "eleven", "description": "testDescription11...", "comment": "11"}
{"name": "twelve", "description": "testDescription12...", "comment": "12"}
{"name": "thirteen", "description": "testDescription13...", "comment": "13"}

我的目标是编写一个新的 jsonl 文件(保留编码)名称 merged_file.jsonl 如下所示:

{"name": "one", "description": "testDescription...", "comment": "1"}
{"name": "two", "description": "testDescription2...", "comment": "2"}
{"name": "eleven", "description": "testDescription11...", "comment": "11"}
{"name": "twelve", "description": "testDescription12...", "comment": "12"}
{"name": "thirteen", "description": "testDescription13...", "comment": "13"}

我的做法是这样的:

import json
import glob

result = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
    with open(f, 'r', encoding='utf-8-sig') as infile:
        try:
            result.append(extract_json(infile)) #tried json.loads(infile) too
        except ValueError:
            print(f)

#write the file in BOM TO preserve the emojis and special characters
with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile:
    json.dump(result, outfile)

但是我遇到了这个错误: TypeError: Object of type generator is not JSON serializable 我会以任何方式感谢您的 hint/help。谢谢!我看过其他 SO 回购协议,他们都在写正常的 json 文件,这在我的情况下也应该有效,但它一直失败。

像这样读取单个文件有效:

data_json = io.open('one.jsonl', mode='r', encoding='utf-8-sig') # Opens in the JSONL file
data_python = extract_json(data_json)
for line in data_python:
    print(line)

####outputs####
#{'name': 'one', 'description': 'testDescription...', 'comment': '1'}
#{'name': 'two', 'description': 'testDescription2...', 'comment': '2'}

有可能 extract_json returns 生成器而不是 list/dict 是 json 可序列化的
因为它是 jsonl,这意味着每一行都是有效的 json
所以你只需要稍微调整一下你现有的代码。

import json
import glob

result = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
    with open(f, 'r', encoding='utf-8-sig') as infile:
        for line in infile.readlines():
            try:
                result.append(json.loads(line)) # read each line of the file
            except ValueError:
                print(f)

# This would output jsonl
with open('merged_file.jsonl','w', encoding= 'utf-8-sig') as outfile:
    #json.dump(result, outfile)
    #write each line as a json
    outfile.write("\n".join(map(json.dumps, result)))

现在我想到了,你甚至不必使用 json 加载它,除非它会帮助你清理任何格式错误的 JSON 行是全部

你可以像这样一次收集所有线条

outfile = open('merged_file.jsonl','w', encoding= 'utf-8-sig')
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
    with open(f, 'r', encoding='utf-8-sig') as infile:
        for line in infile.readlines():
            outfile.write(line)
outfile.close()

另一个超级简单的方法,如果你不关心 json 验证

cat folder_with_all_jsonl/*.jsonl > merged_file.jsonl