从 json 文件中的 json 数组中删除重复项

Remove duplicates from json array in a json file

我有一个非常大的 json 文件,其中包含数千行,如下所示(已删除):

[
{"result": ["/results/1138/dundalk-aw/2022-03-11/806744", "/results/1138/dundalk-aw/2022-03-11/806745", "/results/1138/dundalk-aw/2022-03-11/806746", "/results/1138/dundalk-aw/2022-03-11/806747", "/results/1138/dundalk-aw/2022-03-11/806748", "/results/1138/dundalk-aw/2022-03-11/806749", "/results/1138/dundalk-aw/2022-03-11/806750", "/results/1138/dundalk-aw/2022-03-11/806751", "/results/14/exeter/2022-03-11/804190", "/results/14/exeter/2022-03-11/804193", "/results/14/exeter/2022-03-11/804194", "/results/14/exeter/2022-03-11/804192", "/results/14/exeter/2022-03-11/804196", "/results/14/exeter/2022-03-11/804191", "/results/14/exeter/2022-03-11/804195", "/results/30/leicester/2022-03-11/804201", "/results/30/leicester/2022-03-11/804200", "/results/30/leicester/2022-03-11/804198", "/results/30/leicester/2022-03-11/804197", "/results/30/leicester/2022-03-11/804199", "/results/30/leicester/2022-03-11/804202", "/results/37/newcastle/2022-03-11/804181", "/results/37/newcastle/2022-03-11/804179", "/results/37/newcastle/2022-03-11/804182", "/results/37/newcastle/2022-03-11/804180", "/results/37/newcastle/2022-03-11/804177", "/results/37/newcastle/2022-03-11/804176", "/results/37/newcastle/2022-03-11/804178", "/results/513/wolverhampton-aw/2022-03-11/804352", "/results/513/wolverhampton-aw/2022-03-11/804353", "/results/513/wolverhampton-aw/2022-03-11/806925", "/results/513/wolverhampton-aw/2022-03-11/804350", "/results/513/wolverhampton-aw/2022-03-11/804354", "/results/513/wolverhampton-aw/2022-03-11/804349", "/results/513/wolverhampton-aw/2022-03-11/804351", "/results/1303/al-ain/2022-03-11/806926", "/results/1244/goulburn/2022-03-11/807045", "/results/869/sakhir/2022-03-11/806948", "/results/1244/goulburn/2022-03-11/807045", "/results/869/sakhir/2022-03-11/806948"]},
{"result": ["/results/8/carlisle/2022-03-10/804174", "/results/8/carlisle/2022-03-10/804172", "/results/8/carlisle/2022-03-10/804170", "/results/8/carlisle/2022-03-10/804175", "/results/8/carlisle/2022-03-10/804171", "/results/8/carlisle/2022-03-10/804173", "/results/8/carlisle/2022-03-10/805620", "/results/1353/newcastle-aw/2022-03-10/804340", "/results/1353/newcastle-aw/2022-03-10/804341", "/results/1353/newcastle-aw/2022-03-10/804338", "/results/1353/newcastle-aw/2022-03-10/804342", "/results/1353/newcastle-aw/2022-03-10/804337", "/results/1353/newcastle-aw/2022-03-10/804339", "/results/394/southwell-aw/2022-03-10/804346", "/results/394/southwell-aw/2022-03-10/804344", "/results/394/southwell-aw/2022-03-10/804345", "/results/394/southwell-aw/2022-03-10/804348", "/results/394/southwell-aw/2022-03-10/806779", "/results/394/southwell-aw/2022-03-10/804343", "/results/394/southwell-aw/2022-03-10/804347", "/results/394/southwell-aw/2022-03-10/806778", "/results/198/thurles/2022-03-10/806623", "/results/198/thurles/2022-03-10/806624", "/results/198/thurles/2022-03-10/806625", "/results/198/thurles/2022-03-10/806626", "/results/198/thurles/2022-03-10/806627", "/results/198/thurles/2022-03-10/806628", "/results/198/thurles/2022-03-10/806629", "/results/90/wincanton/2022-03-10/804183", "/results/90/wincanton/2022-03-10/804186", "/results/90/wincanton/2022-03-10/804188", "/results/90/wincanton/2022-03-10/804185", "/results/90/wincanton/2022-03-10/804187", "/results/90/wincanton/2022-03-10/804184", "/results/90/wincanton/2022-03-10/804189", "/results/219/saint-cloud/2022-03-10/807032", "/results/219/saint-cloud/2022-03-10/806812", "/results/219/saint-cloud/2022-03-10/806837", "/results/219/saint-cloud/2022-03-10/807033", "/results/219/saint-cloud/2022-03-10/807037", "/results/219/saint-cloud/2022-03-10/807041", "/results/219/saint-cloud/2022-03-10/807042", "/results/219/saint-cloud/2022-03-10/807043", "/results/219/saint-cloud/2022-03-10/807044", "/results/219/saint-cloud/2022-03-10/806837", "/results/219/saint-cloud/2022-03-10/807033"]}
]

现在,“结果”数组中有一些重复项。在这种情况下,例如 /results/1244/goulburn/2022-03-11/807045

如何过滤掉这些重复项? 我在 Whosebug 上找到了一些解决方案来检查重复的“结果”数组,但不是检查数组中的任何内容是否重复。至少我没有尝试过,但我想我搞砸了。 试了两天,但我自己无法弄清楚这个问题 - 或者我太笨了,无法在 Whosebug 上的 smiliar 问题中找到答案 - 而且我的知识非常非常有限 Java。

我研究过将 json 转换为列表,然后过滤掉重复项,但对于大文件来说这似乎很笨重?

您可以轻松地在 python 中使用 set 数据类型,并且您可以使用一个进入 Dict 文件的循环并一个接一个地追加(使用 update())它到一个集合变量,遗憾的是我懒得写代码来解决你的问题,但你可以阅读更多关于集合的信息并编写你的代码( W3-School How to append to a set)

加载所有 JSON 数据后,您可以映射结果,使用 set 删除重复的数据并转换回 list 以保留原始结构:

data = [{...}]  # large JSON data list
data = list(map(lambda x: {'result': list(set(x['result']))}, data))