严格使用 JSON,如何将 key:values 重新排序为特定的 JSON 模式以进行 Open Refine
Strict consumption of JSON, how to reorder key:values to specific JSON schema for Open Refine
尝试使用 Open Refine 分析混乱的 JSON 字符串数据集(40k 行),但是由于 JSONs 的无序性质,[= 的某些行51=] 对象在返回并记录到文件时混淆了。
有些对象缺少键,有些对象的顺序不正确。示例:
1 {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]}
2 {"id":"22","about":"barFoo", "category":"NotABar"}
3 {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]}
....
....
....
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""}
问题:
将数据导入 Open Refine 时,程序会要求提供特定架构以与读取文件时进行比较。然后它读取提供的文件,将行中的每个 JSON 对象与模式进行比较,并根据它与模式的匹配程度导入或丢弃!结果遗漏了许多条目!
理想:
使用 Python,我想将 JSON 对象重新排序为我指定的特定模式。
示例:
指定架构
{"about":"", "category":"", "id":"", "cat_list": ""}
然后将 JSON 的每一行及其键值重新排列为以下特定格式:
1 {"about": ....
2 {"about": ....
3 {"about": ....
....
....
....
40,000 {"about": ....
我不完全确定如何才能有效地做到这一点?
编辑:
我决定只写一个脚本来组织这个。我删除了一些复杂的字段并得到了一个完整的 .JSON 文件:
{"name":"Carstar Bridgewater",
"category":"Automotive",
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.",
"country":"Canada",
"state":"NS",
"city":"Bridgewater
"},
{"name":"Febreze",
"category":"Product/Service
",
"about":"Freshness that eliminates odorsso you can breathe happy.",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings",
"category":"Professional Services",
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings",
"country":"Canada",
"state":"NS
",
"city":"Middle Sackville"},
{"name":"The Hunger Games",
"category":"Movie
",
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
还没有。 Google-Refine 仍然拒绝接受我的文件?我做错了什么?
"Importing the data into Open Refine, the program asks for a specific schema to compare to when it reads the file."
这听起来像是它不小心将其检测为 XML 而不是 JSON 甚至是线条。
但是,您可以选择要使用的导入器(例如基于行或 JSON),而不仅仅是 OpenRefine 尝试猜测但有时会出错的自动选择的导入器。
在我看来,您可能正在处理即将推出的新 "JSON Lines" 或 "newline-delimited JSON" 格式,例如此处记录的格式:http://jsonlines.org/
我们有一个问题未解决,最终要向 OpenRefine 添加 JSON 行支持:https://github.com/OpenRefine/OpenRefine/issues/1135
同时,请查看 On the Web at the jsonlines.org site 部分以获得工具支持以帮助您满足您的需求。
不确定您是否解决了这个问题。
JSON 必须有效才能成功导入 - 目前,您在上述 Q 中发布的文本无法使用 http://jsonlint.com 等工具进行验证。
将此导入 OpenRefine(又名 Google Refine)时遇到的问题是 JSON 对象必须在数组中:
[{"name":"Carstar Bridgewater",
"category":"Automotive",
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.",
"country":"Canada",
"state":"NS",
"city":"Bridgewater"},
{"name":"Febreze",
"category":"Product/Service",
"about":"Freshness that eliminates odorsso you can breathe happy.",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings",
"category":"Professional Services",
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings",
"country":"Canada",
"state":"NS",
"city":"Middle Sackville"},
{"name":"The Hunger Games",
"category":"Movie",
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"}]
我可以成功地将此处发布的 JSON 导入到 OpenRefine 中,它工作正常 - 屏幕截图:
尝试使用 Open Refine 分析混乱的 JSON 字符串数据集(40k 行),但是由于 JSONs 的无序性质,[= 的某些行51=] 对象在返回并记录到文件时混淆了。
有些对象缺少键,有些对象的顺序不正确。示例:
1 {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]}
2 {"id":"22","about":"barFoo", "category":"NotABar"}
3 {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]}
....
....
....
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""}
问题:
将数据导入 Open Refine 时,程序会要求提供特定架构以与读取文件时进行比较。然后它读取提供的文件,将行中的每个 JSON 对象与模式进行比较,并根据它与模式的匹配程度导入或丢弃!结果遗漏了许多条目!
理想:
使用 Python,我想将 JSON 对象重新排序为我指定的特定模式。
示例:
指定架构
{"about":"", "category":"", "id":"", "cat_list": ""}
然后将 JSON 的每一行及其键值重新排列为以下特定格式:
1 {"about": ....
2 {"about": ....
3 {"about": ....
....
....
....
40,000 {"about": ....
我不完全确定如何才能有效地做到这一点?
编辑:
我决定只写一个脚本来组织这个。我删除了一些复杂的字段并得到了一个完整的 .JSON 文件:
{"name":"Carstar Bridgewater",
"category":"Automotive",
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.",
"country":"Canada",
"state":"NS",
"city":"Bridgewater
"},
{"name":"Febreze",
"category":"Product/Service
",
"about":"Freshness that eliminates odorsso you can breathe happy.",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings",
"category":"Professional Services",
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings",
"country":"Canada",
"state":"NS
",
"city":"Middle Sackville"},
{"name":"The Hunger Games",
"category":"Movie
",
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
还没有。 Google-Refine 仍然拒绝接受我的文件?我做错了什么?
"Importing the data into Open Refine, the program asks for a specific schema to compare to when it reads the file."
这听起来像是它不小心将其检测为 XML 而不是 JSON 甚至是线条。
但是,您可以选择要使用的导入器(例如基于行或 JSON),而不仅仅是 OpenRefine 尝试猜测但有时会出错的自动选择的导入器。
在我看来,您可能正在处理即将推出的新 "JSON Lines" 或 "newline-delimited JSON" 格式,例如此处记录的格式:http://jsonlines.org/
我们有一个问题未解决,最终要向 OpenRefine 添加 JSON 行支持:https://github.com/OpenRefine/OpenRefine/issues/1135
同时,请查看 On the Web at the jsonlines.org site 部分以获得工具支持以帮助您满足您的需求。
不确定您是否解决了这个问题。
JSON 必须有效才能成功导入 - 目前,您在上述 Q 中发布的文本无法使用 http://jsonlint.com 等工具进行验证。
将此导入 OpenRefine(又名 Google Refine)时遇到的问题是 JSON 对象必须在数组中:
[{"name":"Carstar Bridgewater",
"category":"Automotive",
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.",
"country":"Canada",
"state":"NS",
"city":"Bridgewater"},
{"name":"Febreze",
"category":"Product/Service",
"about":"Freshness that eliminates odorsso you can breathe happy.",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings",
"category":"Professional Services",
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings",
"country":"Canada",
"state":"NS",
"city":"Middle Sackville"},
{"name":"The Hunger Games",
"category":"Movie",
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"}]
我可以成功地将此处发布的 JSON 导入到 OpenRefine 中,它工作正常 - 屏幕截图: