AWS Sagemaker 输出如何读取多个 json 对象分布在多行中的文件
AWS Sagemaker output how to read file with multiple json objects spread out over multiple lines
我有一堆 json 个看起来像这样的文件
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142, ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774, ], "word": "blah blah blah"}
我可以阅读
f = open(file_name)
data = []
for line in f:
data.append(json.dumps(line))
但是我有另一个文件,输出是这样的
{
"predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
]
}
{
"predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
]
}
{
"predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
]
}
即json 被格式化为多行,所以我不能简单地逐行读取 json 。有没有简单的方法来解析这个?或者我是否必须写一些东西来逐行将每个 json 对象拼接在一起,然后 json.loads?
嗯,据我所知,不幸的是,无法使用 json.loads
加载 JSONL 格式的数据。不过,一种选择是提出一个辅助函数,可以将其转换为有效的 JSON 字符串,如下所示:
import json
string = """
{
"predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
]
}
{
"predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
]
}
{
"predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
]
}
"""
def json_lines_to_json(s: str) -> str:
# replace the first occurrence of '{'
s = s.replace('{', '[{', 1)
# replace the last occurrence of '}
s = s.rsplit('}', 1)[0] + '}]'
# now go in and replace all occurrences of '}' immediately followed
# by newline with a '},'
s = s.replace('}\n', '},\n')
return s
print(json.loads(json_lines_to_json(string)))
打印:
[{'predictions': [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]]}, {'predictions': [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]]}, {'predictions': [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]]}]
注意: 您的第一个示例实际上似乎无效 JSON(或者至少 JSON 行,据我所知)。特别是,由于最后一个数组元素后的尾随逗号,这部分似乎无效:
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], ...}
为了确保在调用辅助函数后有效,您还需要删除尾随的逗号,因此每行的格式如下:
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], ...},
似乎还有一个 ,他们建议在换行符上拆分并在每行上调用 json.loads
;实际上,在每个对象上多次调用 json.loads
的性能应该(略)低,而不是在列表中调用一次,如下所示。
from timeit import timeit
import json
string = """\
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142 ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774 ], "word": "blah blah blah"}\
"""
def json_lines_to_json(s: str) -> str:
# Strip newlines from end, then replace all occurrences of '}' followed
# by a newline, by a '},' followed by a newline.
s = s.rstrip('\n').replace('}\n', '},\n')
# return string value wrapped in brackets (list)
return f'[{s}]'
n = 10_000
print('string replace: ', timeit(r'json.loads(json_lines_to_json(string))', number=n, globals=globals()))
print('json.loads each line: ', timeit(r'[json.loads(line) for line in string.split("\n")]', number=n, globals=globals()))
结果:
string replace: 0.07599360000000001
json.loads each line: 0.1078384
我有一堆 json 个看起来像这样的文件
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142, ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774, ], "word": "blah blah blah"}
我可以阅读
f = open(file_name)
data = []
for line in f:
data.append(json.dumps(line))
但是我有另一个文件,输出是这样的
{
"predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
]
}
{
"predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
]
}
{
"predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
]
}
即json 被格式化为多行,所以我不能简单地逐行读取 json 。有没有简单的方法来解析这个?或者我是否必须写一些东西来逐行将每个 json 对象拼接在一起,然后 json.loads?
嗯,据我所知,不幸的是,无法使用 json.loads
加载 JSONL 格式的数据。不过,一种选择是提出一个辅助函数,可以将其转换为有效的 JSON 字符串,如下所示:
import json
string = """
{
"predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
]
}
{
"predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
]
}
{
"predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
]
}
"""
def json_lines_to_json(s: str) -> str:
# replace the first occurrence of '{'
s = s.replace('{', '[{', 1)
# replace the last occurrence of '}
s = s.rsplit('}', 1)[0] + '}]'
# now go in and replace all occurrences of '}' immediately followed
# by newline with a '},'
s = s.replace('}\n', '},\n')
return s
print(json.loads(json_lines_to_json(string)))
打印:
[{'predictions': [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]]}, {'predictions': [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]]}, {'predictions': [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]]}]
注意: 您的第一个示例实际上似乎无效 JSON(或者至少 JSON 行,据我所知)。特别是,由于最后一个数组元素后的尾随逗号,这部分似乎无效:
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], ...}
为了确保在调用辅助函数后有效,您还需要删除尾随的逗号,因此每行的格式如下:
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], ...},
似乎还有一个 json.loads
;实际上,在每个对象上多次调用 json.loads
的性能应该(略)低,而不是在列表中调用一次,如下所示。
from timeit import timeit
import json
string = """\
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142 ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774 ], "word": "blah blah blah"}\
"""
def json_lines_to_json(s: str) -> str:
# Strip newlines from end, then replace all occurrences of '}' followed
# by a newline, by a '},' followed by a newline.
s = s.rstrip('\n').replace('}\n', '},\n')
# return string value wrapped in brackets (list)
return f'[{s}]'
n = 10_000
print('string replace: ', timeit(r'json.loads(json_lines_to_json(string))', number=n, globals=globals()))
print('json.loads each line: ', timeit(r'[json.loads(line) for line in string.split("\n")]', number=n, globals=globals()))
结果:
string replace: 0.07599360000000001
json.loads each line: 0.1078384