根据 ID 将嵌套的 JSON 文件拆分为两个 JSON?
Split nested JSON File into two JSONs according to their ID?
我嵌套了 JSON 文件,该文件作为 python 字典加载,名为 movies_data
,如下所示:
with open('project_folder/data_movie_absa.json') as infile:
movies_data = json.load(infile)
它具有以下结构:
{ "review_1": {"tokens": ["Best", "show", "ever", "!"],
"movie_user_4": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]},
"movie_user_6": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}},
"review_2": {"tokens": ["Its", "a", "great", "show"],
"movie_user_1": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]},
"movie_user_6": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}},
"review_3": {"tokens": ["I", "love", "this", "actor", "!"],
"movie_user_17": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]},
"movie_user_23": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}},
"review_4": {"tokens": ["Bad", "movie"],
"movie_user_1": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]},
"movie_user_6": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}}
...
}
它有 3324 个键值对(即最多键 review_3224)。我想根据特定的键列表将此文件拆分为两个 json 文件(train_movies.json
、test_movies.json
):
test_IDS = ['review_2', 'review_4']
with open("train_movies.json", "w", encoding="utf-8-sig") as outfile_train, open("test_movies.json", "w", encoding="utf-8-sig") as outfile_test:
for review_id, review in movies_data.items():
if review_id in test_IDS:
outfile = outfile_test
outfile.write('{"%s": "%s"}' % (review_id, movies_data[review_id]))
else:
outfile = outfile_train
outfile.write('{"%s": "%s"}' % (review_id, movies_data[review_id]))
outfile.close()
对于test_movies.json我有以下结构:
{"review_2": "{'tokens': ['Its', 'a', 'great', 'show'],
'movie_user_4': {'aspects': ['O', 'O', 'O', 'B_A'], 'sentiments': ['O', 'O', 'B_S', 'O']},
'movie_user_6': {'aspects': ['O', 'O', 'O', 'B_A'], 'sentiments': ['O', 'O', 'B_S', 'O']}}"}
{"review_4": "{'tokens': ['Bad', 'movie'],
'movie_user_1': {'aspects': ['O', 'B_A'], 'sentiments': ['B_S', 'O']},
'movie_user_6': {'aspects': ['O', 'B_A'], 'sentiments': ['B_S', 'O']}}"}
不幸的是,这个结构有一些问题比如不一致的双引号(" vs. '
),没有评论之间的逗号等...因此,通过将test_movies.json
读取为json
文件,我遇到了以下问题:
with open('project_folder/test_movies.json') as infile:
testing_data = json.load(infile)
错误信息:
JSONDecodeError Traceback (most recent call last)
<ipython-input-10-3548a718f421> in <module>()
1 with open('/content/gdrive/My Drive/project_folder/test_movies.json') as infile:
----> 2 testing_data = json.load(infile)
1 frames
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
342 if s.startswith('\ufeff'):
343 raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
--> 344 s, 0)
345 else:
346 if not isinstance(s, (bytes, bytearray)):
JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)
期望的输出 应该有一个正确的 json 结构,就像原来的 movies_data
这样 python 可以将它正确地读作字典。
你能帮我更正我的 python 代码吗?
提前致谢!
问题
- 需要使用json.dumps创建输出字符串写入文件。
- 使用 Python 字符串格式,即 '{"%s": "%s"}' % (review_id, movies_data[review_id]) 会产生问题你描述了
代码
train, test = {}, {} # Dicionaries for storing training and test data
for review_id, review in movies_data.items():
if review_id in test_IDS:
test[review_id] = review
else:
train[review_id] = review
# Output Test
with open("test_movies.json", "w") as outfile_test:
json.dump(test, outfile_test)
# Output training
with open("train_movies.json", "w") as outfile_train:
json.dump(train, outfile_train)
结果
输入: test.json
的文件内容
{ "review_1": {"tokens": ["Best", "show", "ever", "!"],
"movie_user_4": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]},
"movie_user_6": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}},
"review_2": {"tokens": ["Its", "a", "great", "show"],
"movie_user_1": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]},
"movie_user_6": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}},
"review_3": {"tokens": ["I", "love", "this", "actor", "!"],
"movie_user_17": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]},
"movie_user_23": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}},
"review_4": {"tokens": ["Bad", "movie"],
"movie_user_1": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]},
"movie_user_6": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}}
}
输出: test_movies.json
的文件内容
{"review_2": {"tokens": ["Its", "a", "great", "show"], "movie_user_1": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}, "movie_user_6": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}}, "review_4": {"tokens": ["Bad", "movie"], "movie_user_1": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}, "movie_user_6": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}}}
输出: train_movies.json
的文件内容
{"review_1": {"tokens": ["Best", "show", "ever", "!"], "movie_user_4": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}, "movie_user_6": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}}, "review_3": {"tokens": ["I", "love", "this", "actor", "!"], "movie_user_17": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}, "movie_user_23": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}}}
我嵌套了 JSON 文件,该文件作为 python 字典加载,名为 movies_data
,如下所示:
with open('project_folder/data_movie_absa.json') as infile:
movies_data = json.load(infile)
它具有以下结构:
{ "review_1": {"tokens": ["Best", "show", "ever", "!"],
"movie_user_4": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]},
"movie_user_6": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}},
"review_2": {"tokens": ["Its", "a", "great", "show"],
"movie_user_1": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]},
"movie_user_6": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}},
"review_3": {"tokens": ["I", "love", "this", "actor", "!"],
"movie_user_17": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]},
"movie_user_23": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}},
"review_4": {"tokens": ["Bad", "movie"],
"movie_user_1": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]},
"movie_user_6": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}}
...
}
它有 3324 个键值对(即最多键 review_3224)。我想根据特定的键列表将此文件拆分为两个 json 文件(train_movies.json
、test_movies.json
):
test_IDS = ['review_2', 'review_4']
with open("train_movies.json", "w", encoding="utf-8-sig") as outfile_train, open("test_movies.json", "w", encoding="utf-8-sig") as outfile_test:
for review_id, review in movies_data.items():
if review_id in test_IDS:
outfile = outfile_test
outfile.write('{"%s": "%s"}' % (review_id, movies_data[review_id]))
else:
outfile = outfile_train
outfile.write('{"%s": "%s"}' % (review_id, movies_data[review_id]))
outfile.close()
对于test_movies.json我有以下结构:
{"review_2": "{'tokens': ['Its', 'a', 'great', 'show'],
'movie_user_4': {'aspects': ['O', 'O', 'O', 'B_A'], 'sentiments': ['O', 'O', 'B_S', 'O']},
'movie_user_6': {'aspects': ['O', 'O', 'O', 'B_A'], 'sentiments': ['O', 'O', 'B_S', 'O']}}"}
{"review_4": "{'tokens': ['Bad', 'movie'],
'movie_user_1': {'aspects': ['O', 'B_A'], 'sentiments': ['B_S', 'O']},
'movie_user_6': {'aspects': ['O', 'B_A'], 'sentiments': ['B_S', 'O']}}"}
不幸的是,这个结构有一些问题比如不一致的双引号(" vs. '
),没有评论之间的逗号等...因此,通过将test_movies.json
读取为json
文件,我遇到了以下问题:
with open('project_folder/test_movies.json') as infile:
testing_data = json.load(infile)
错误信息:
JSONDecodeError Traceback (most recent call last)
<ipython-input-10-3548a718f421> in <module>()
1 with open('/content/gdrive/My Drive/project_folder/test_movies.json') as infile:
----> 2 testing_data = json.load(infile)
1 frames
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
342 if s.startswith('\ufeff'):
343 raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
--> 344 s, 0)
345 else:
346 if not isinstance(s, (bytes, bytearray)):
JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)
期望的输出 应该有一个正确的 json 结构,就像原来的 movies_data
这样 python 可以将它正确地读作字典。
你能帮我更正我的 python 代码吗?
提前致谢!
问题
- 需要使用json.dumps创建输出字符串写入文件。
- 使用 Python 字符串格式,即 '{"%s": "%s"}' % (review_id, movies_data[review_id]) 会产生问题你描述了
代码
train, test = {}, {} # Dicionaries for storing training and test data
for review_id, review in movies_data.items():
if review_id in test_IDS:
test[review_id] = review
else:
train[review_id] = review
# Output Test
with open("test_movies.json", "w") as outfile_test:
json.dump(test, outfile_test)
# Output training
with open("train_movies.json", "w") as outfile_train:
json.dump(train, outfile_train)
结果
输入: test.json
的文件内容{ "review_1": {"tokens": ["Best", "show", "ever", "!"],
"movie_user_4": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]},
"movie_user_6": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}},
"review_2": {"tokens": ["Its", "a", "great", "show"],
"movie_user_1": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]},
"movie_user_6": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}},
"review_3": {"tokens": ["I", "love", "this", "actor", "!"],
"movie_user_17": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]},
"movie_user_23": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}},
"review_4": {"tokens": ["Bad", "movie"],
"movie_user_1": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]},
"movie_user_6": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}}
}
输出: test_movies.json
的文件内容{"review_2": {"tokens": ["Its", "a", "great", "show"], "movie_user_1": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}, "movie_user_6": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}}, "review_4": {"tokens": ["Bad", "movie"], "movie_user_1": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}, "movie_user_6": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}}}
输出: train_movies.json
的文件内容{"review_1": {"tokens": ["Best", "show", "ever", "!"], "movie_user_4": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}, "movie_user_6": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}}, "review_3": {"tokens": ["I", "love", "this", "actor", "!"], "movie_user_17": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}, "movie_user_23": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}}}