使用 Python 合并文件夹中的多个 JSONL 文件
Merge multiple JSONL files from a folder using Python
我正在寻找一种解决方案,使用 Python 脚本从一个文件夹中合并多个 JSONL 文件。类似于下面的脚本,适用于 JSON 个文件。
import json
import glob
result = []
for f in glob.glob("*.json"):
with jsonlines.open(f) as infile:
result.append(json.load(infile))
with open("merged_file.json", "wb") as outfile:
json.dump(result, outfile)
请在下面找到我的 JSONL 文件的示例(只有一行):
{"date":"2021-01-02T08:40:11.378000000Z","partitionId":"0","sequenceNumber":"4636458","offset":"1327163410568","iotHubDate":"2021-01-02T08:40:11.258000000Z","iotDeviceId":"text","iotMsg":{"header":{"deviceTokenJwt":"text","msgType":"text","msgOffset":3848,"msgKey":"text","msgCreation":"2021-01-02T09:40:03.961+01:00","appName":"text","appVersion":"text","customerType":"text","customerGroup":"Customer"},"msgData":{"serialNumber":"text","machineComponentTypeId":"text","applicationVersion":"3.1.4","bootloaderVersion":"text","firstConnectionDate":"2018-02-20T10:34:47+01:00","lastConnectionDate":"2020-12-31T12:05:04.113+01:00","counters":[{"type":"DurationCounter","id":"text","value":"text"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":2423},{"type":"IntegerCounter","id":"text","value":9914},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":976},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"IntegerCounter","id":"text","value":28},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":1}],"defects":[{"description":"ProtocolDb.ProtocolIdNotFound","defectLevelId":"Warning","occurrence":3},{"description":"BridgeBus.CrcError","defectLevelId":"Warning","occurrence":1},{"description":"BridgeBus.Disconnected","defectLevelId":"Warning","occurrence":6}],"maintenanceEvents":[{"interventionId":"Other","comment":"text","appearance_display":0,"intervention_date":"2018-11-29T09:52:16.726+01:00","intervention_counterValue":"text","intervention_workerName":"text"},{"interventionId":"Other","comment":"text","appearance_display":0,"intervention_date":"2019-06-04T15:30:15.954+02:00","intervention_counterValue":"text","intervention_workerName":"text"}]}}}
有谁知道我该如何处理加载这个?
您可以使用加载的每个 json 对象更新主字典。喜欢
import json
import glob
result = {}
for f in glob.glob("*.json"):
with jsonlines.open(f) as infile:
result.update(json.load(infile)) #merge the dicts
with open("merged_file.json", "wb") as outfile:
json.dump(result, outfile)
但这会覆盖相似的键。!
由于 JSONL 文件中的每一行都是一个完整的 JSON 对象,您实际上根本不需要解析 JSONL 文件来合并它们到另一个 JSONL 文件中。相反,通过简单地连接它们来合并它们。但是,这里需要注意的是 JSONL 格式不强制在文件末尾使用换行符。因此,您必须将每一行读入缓冲区以测试 JSONL 文件是否以换行符结尾,在这种情况下,您必须显式输出换行符以分隔下一个记录的第一条记录文件:
with open("merged_file.json", "w") as outfile:
for filename in glob.glob("*.json"):
with open(filename) as infile:
for line in infile:
outfile.write(line)
if not line.endswith('\n'):
outfile.write('\n')
我正在寻找一种解决方案,使用 Python 脚本从一个文件夹中合并多个 JSONL 文件。类似于下面的脚本,适用于 JSON 个文件。
import json
import glob
result = []
for f in glob.glob("*.json"):
with jsonlines.open(f) as infile:
result.append(json.load(infile))
with open("merged_file.json", "wb") as outfile:
json.dump(result, outfile)
请在下面找到我的 JSONL 文件的示例(只有一行):
{"date":"2021-01-02T08:40:11.378000000Z","partitionId":"0","sequenceNumber":"4636458","offset":"1327163410568","iotHubDate":"2021-01-02T08:40:11.258000000Z","iotDeviceId":"text","iotMsg":{"header":{"deviceTokenJwt":"text","msgType":"text","msgOffset":3848,"msgKey":"text","msgCreation":"2021-01-02T09:40:03.961+01:00","appName":"text","appVersion":"text","customerType":"text","customerGroup":"Customer"},"msgData":{"serialNumber":"text","machineComponentTypeId":"text","applicationVersion":"3.1.4","bootloaderVersion":"text","firstConnectionDate":"2018-02-20T10:34:47+01:00","lastConnectionDate":"2020-12-31T12:05:04.113+01:00","counters":[{"type":"DurationCounter","id":"text","value":"text"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":2423},{"type":"IntegerCounter","id":"text","value":9914},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":976},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"IntegerCounter","id":"text","value":28},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":1}],"defects":[{"description":"ProtocolDb.ProtocolIdNotFound","defectLevelId":"Warning","occurrence":3},{"description":"BridgeBus.CrcError","defectLevelId":"Warning","occurrence":1},{"description":"BridgeBus.Disconnected","defectLevelId":"Warning","occurrence":6}],"maintenanceEvents":[{"interventionId":"Other","comment":"text","appearance_display":0,"intervention_date":"2018-11-29T09:52:16.726+01:00","intervention_counterValue":"text","intervention_workerName":"text"},{"interventionId":"Other","comment":"text","appearance_display":0,"intervention_date":"2019-06-04T15:30:15.954+02:00","intervention_counterValue":"text","intervention_workerName":"text"}]}}}
有谁知道我该如何处理加载这个?
您可以使用加载的每个 json 对象更新主字典。喜欢
import json
import glob
result = {}
for f in glob.glob("*.json"):
with jsonlines.open(f) as infile:
result.update(json.load(infile)) #merge the dicts
with open("merged_file.json", "wb") as outfile:
json.dump(result, outfile)
但这会覆盖相似的键。!
由于 JSONL 文件中的每一行都是一个完整的 JSON 对象,您实际上根本不需要解析 JSONL 文件来合并它们到另一个 JSONL 文件中。相反,通过简单地连接它们来合并它们。但是,这里需要注意的是 JSONL 格式不强制在文件末尾使用换行符。因此,您必须将每一行读入缓冲区以测试 JSONL 文件是否以换行符结尾,在这种情况下,您必须显式输出换行符以分隔下一个记录的第一条记录文件:
with open("merged_file.json", "w") as outfile:
for filename in glob.glob("*.json"):
with open(filename) as infile:
for line in infile:
outfile.write(line)
if not line.endswith('\n'):
outfile.write('\n')