严格使用 JSON,如何将 key:values 重新排序为特定的 JSON 模式以进行 Open Refine

Strict consumption of JSON, how to reorder key:values to specific JSON schema for Open Refine

尝试使用 Open Refine 分析混乱的 JSON 字符串数据集(40k 行),但是由于 JSONs 的无序性质,[= 的某些行51=] 对象在返回并记录到文件时混淆了。

有些对象缺少键,有些对象的顺序不正确。示例:

1   {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]}
2   {"id":"22","about":"barFoo", "category":"NotABar"}
3   {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]}
....
....
....
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""}

问题:

将数据导入 Open Refine 时,程序会要求提供特定架构以与读取文件时进行比较。然后它读取提供的文件,将行中的每个 JSON 对象与模式进行比较,并根据它与模式的匹配程度导入或丢弃!结果遗漏了许多条目!

理想:

使用 Python,我想将 JSON 对象重新排序为我指定的特定模式。

示例:

指定架构

{"about":"", "category":"", "id":"", "cat_list": ""}

然后将 JSON 的每一行及其键值重新排列为以下特定格式:

1   {"about": ....
2   {"about": ....
3   {"about": ....
....
....
....
40,000 {"about": ....

我不完全确定如何才能有效地做到这一点?

编辑:

我决定只写一个脚本来组织这个。我删除了一些复杂的字段并得到了一个完整的 .JSON 文件:

{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater
"}, 
{"name":"Febreze", 
"category":"Product/Service
", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS
", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie
", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},

还没有。 Google-Refine 仍然拒绝接受我的文件?我做错了什么?

"Importing the data into Open Refine, the program asks for a specific schema to compare to when it reads the file."

这听起来像是它不小心将其检测为 XML 而不是 JSON 甚至是线条。

但是,您可以选择要使用的导入器(例如基于行或 JSON),而不仅仅是 OpenRefine 尝试猜测但有时会出错的自动选择的导入器。

在我看来,您可能正在处理即将推出的新 "JSON Lines" 或 "newline-delimited JSON" 格式,例如此处记录的格式:http://jsonlines.org/

我们有一个问题未解决,最终要向 OpenRefine 添加 JSON 行支持:https://github.com/OpenRefine/OpenRefine/issues/1135

同时,请查看 On the Web at the jsonlines.org site 部分以获得工具支持以帮助您满足您的需求。

不确定您是否解决了这个问题。

JSON 必须有效才能成功导入 - 目前,您在上述 Q 中发布的文本无法使用 http://jsonlint.com 等工具进行验证。

将此导入 OpenRefine(又名 Google Refine)时遇到的问题是 JSON 对象必须在数组中:

[{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater"},
{"name":"Febreze", 
"category":"Product/Service", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"}]

我可以成功地将此处发布的 JSON 导入到 OpenRefine 中,它工作正常 - 屏幕截图: