使用 Python 规范化 JSON
Normalize JSON using Python
我对 JSON
和 Python
比较陌生,自从过去两天以来我一直在努力压平 JSON。
我在 http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.io.json.json_normalize.html, but I didn't understand how to unlist some nested elements. I also read a few threads Flatten JSON based on an attribute - python How to normalize complex nested json in python? and https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10 阅读了示例。我尝试了所有但没有任何运气。
这是我的 JSON 文件的第一条记录:
d =
{'city': {'url': 'link',
'name': ['San Francisco']},
'rank': 1,
'resident': [
{'link': ['bit.ly/0842/'], 'name': ['John A']},
{'link': ['bit.ly/5835/'], 'name': ['Tedd B']},
{'link': ['bit.ly/2011/'], 'name': ['Cobb C']},
{'link': ['bit.ly/0855/'], 'name': ['Jack N']},
{'link': ['bit.ly/1430/'], 'name': ['Jack K']},
{'link': ['bit.ly/3081/'], 'name': ['Edward']},
{'link': ['bit.ly/2001/'], 'name': ['Jack W']},
{'link': ['bit.ly/0020/'], 'name': ['Henry F']},
{'link': ['bit.ly/2137/'], 'name': ['Joseph S']},
{'link': ['bit.ly/3225/'], 'name': ['Ed B']},
{'link': ['bit.ly/3667/'], 'name': ['George Vvec']},
{'link': ['bit.ly/6434/'], 'name': ['Robert W']},
{'link': ['bit.ly/4036/'], 'name': ['Rudy B']},
{'link': ['bit.ly/6450/'], 'name': ['James K']},
{'link': ['bit.ly/5180/'], 'name': ['Billy N']},
{'link': ['bit.ly/7847/'], 'name': ['John S']}]
}
这是预期的输出:
city_url city_name rank resident_link resident_name
link San Francisco 1 'bit.ly/0842/' 'John A'
link San Francisco 1 'bit.ly/5835/' 'Tedd B'
link San Francisco 1 'bit.ly/2011/' 'Cobb C'
link San Francisco 1 'bit.ly/0855/' 'Jack N'
link San Francisco 1 'bit.ly/1430/' 'Jack K'
link San Francisco 1 'bit.ly/3081/' 'Edward'
link San Francisco 1 'bit.ly/2001/' 'Jack W'
link San Francisco 1 'bit.ly/0020/' 'Henry F'
link San Francisco 1 'bit.ly/2137/' 'Joseph S'
link San Francisco 1 'bit.ly/3225/' 'Ed B'
link San Francisco 1 'bit.ly/3667/' 'George Vvec'
link San Francisco 1 'bit.ly/6434/' 'Robert W'
link San Francisco 1 'bit.ly/4036/' 'Rudy B'
link San Francisco 1 'bit.ly/6450/' 'James K'
link San Francisco 1 'bit.ly/5180/' 'Billy N'
link San Francisco 1 'bit.ly/7847/' 'John S'
flatten_json()
函数(来自上面的 Medium.com)破坏了层次结构。这是前几行:
{'city_url': 'link',
'city_name_0': 'San Francisco',
'rank': 1,
'resident_0_link_0': 'bit.ly/0842/',
'resident_0_name_0': 'John A', ...
有人可以帮我考虑如何转换这些数据集吗?不幸的是,pandas
文档没有为初学者提供指导。这就是我在玩的东西。没有任何效果。
from pandas.io.json import json_normalize
json_normalize(d,['city',['name','rank']])
json_normalize(d,['city','name','rank'])
json_normalize(d,['city','name'])
如果有人指导如何进行这些类型的转换和思考过程,我将不胜感激。
此外,由于原始数据集中的数据量,我正在寻找矢量化操作或 O(N)
操作而不是 O(N2)
。因此,任何比 O(N)
慢的都不起作用。
如果你知道 json blob 的结构,这就可以了
resident_link = [k['link'][0] for k in d['resident']]
resident_name = [k['name'][0] for k in d['resident']]
n = len(d['resident'])
city_url = n * [d['city']['url']]
city_name = n * [d['city']['name'][0]]
rank = n * [d['rank']]
df = pandas.DataFrame({
'resident_name' : resident_name,
'resident_link' : resident_link,
'city_url' : city_url,
'city_name' : city_name,
'rank' : rank
})
产生
city_name city_url rank resident_link resident_name
0 San Francisco link 1 bit.ly/0842/ John A
1 San Francisco link 1 bit.ly/5835/ Tedd B
2 San Francisco link 1 bit.ly/2011/ Cobb C
3 San Francisco link 1 bit.ly/0855/ Jack N
4 San Francisco link 1 bit.ly/1430/ Jack K
5 San Francisco link 1 bit.ly/3081/ Edward
6 San Francisco link 1 bit.ly/2001/ Jack W
7 San Francisco link 1 bit.ly/0020/ Henry F
8 San Francisco link 1 bit.ly/2137/ Joseph S
9 San Francisco link 1 bit.ly/3225/ Ed B
10 San Francisco link 1 bit.ly/3667/ George Vvec
11 San Francisco link 1 bit.ly/6434/ Robert W
12 San Francisco link 1 bit.ly/4036/ Rudy B
13 San Francisco link 1 bit.ly/6450/ James K
14 San Francisco link 1 bit.ly/5180/ Billy N
15 San Francisco link 1 bit.ly/7847/ John S
编辑
正如 OP 在评论中所说,假设有很多这样的记录,每个记录都具有相同的结构
nrecords = 10
dd = {k : d for k in range(nrecords)}
dd
现在有原始 json blob 的 10 个副本。这就是代码应该如何更新
ff = pandas.DataFrame()
for record in range(nrecords):
n = len(dd[record]['resident'])
df = {
'resident_link' : [k['link'][0] for k in dd[record]['resident']],
'resident_name' : [k['name'][0] for k in dd[record]['resident']],
'city_url' : n * [dd[record]['city']['url']],
'city_name' : n * [dd[record]['city']['name'][0]],
'rank' : n * [dd[record]['rank']]
}
df = pandas.DataFrame(df)
ff = ff.append(df).reset_index(drop = True)
下面是 运行 时间的估计值,它是记录数的函数。基于此,完成 150 万条记录大约需要 1 小时
我对 JSON
和 Python
比较陌生,自从过去两天以来我一直在努力压平 JSON。
我在 http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.io.json.json_normalize.html, but I didn't understand how to unlist some nested elements. I also read a few threads Flatten JSON based on an attribute - python How to normalize complex nested json in python? and https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10 阅读了示例。我尝试了所有但没有任何运气。
这是我的 JSON 文件的第一条记录:
d =
{'city': {'url': 'link',
'name': ['San Francisco']},
'rank': 1,
'resident': [
{'link': ['bit.ly/0842/'], 'name': ['John A']},
{'link': ['bit.ly/5835/'], 'name': ['Tedd B']},
{'link': ['bit.ly/2011/'], 'name': ['Cobb C']},
{'link': ['bit.ly/0855/'], 'name': ['Jack N']},
{'link': ['bit.ly/1430/'], 'name': ['Jack K']},
{'link': ['bit.ly/3081/'], 'name': ['Edward']},
{'link': ['bit.ly/2001/'], 'name': ['Jack W']},
{'link': ['bit.ly/0020/'], 'name': ['Henry F']},
{'link': ['bit.ly/2137/'], 'name': ['Joseph S']},
{'link': ['bit.ly/3225/'], 'name': ['Ed B']},
{'link': ['bit.ly/3667/'], 'name': ['George Vvec']},
{'link': ['bit.ly/6434/'], 'name': ['Robert W']},
{'link': ['bit.ly/4036/'], 'name': ['Rudy B']},
{'link': ['bit.ly/6450/'], 'name': ['James K']},
{'link': ['bit.ly/5180/'], 'name': ['Billy N']},
{'link': ['bit.ly/7847/'], 'name': ['John S']}]
}
这是预期的输出:
city_url city_name rank resident_link resident_name
link San Francisco 1 'bit.ly/0842/' 'John A'
link San Francisco 1 'bit.ly/5835/' 'Tedd B'
link San Francisco 1 'bit.ly/2011/' 'Cobb C'
link San Francisco 1 'bit.ly/0855/' 'Jack N'
link San Francisco 1 'bit.ly/1430/' 'Jack K'
link San Francisco 1 'bit.ly/3081/' 'Edward'
link San Francisco 1 'bit.ly/2001/' 'Jack W'
link San Francisco 1 'bit.ly/0020/' 'Henry F'
link San Francisco 1 'bit.ly/2137/' 'Joseph S'
link San Francisco 1 'bit.ly/3225/' 'Ed B'
link San Francisco 1 'bit.ly/3667/' 'George Vvec'
link San Francisco 1 'bit.ly/6434/' 'Robert W'
link San Francisco 1 'bit.ly/4036/' 'Rudy B'
link San Francisco 1 'bit.ly/6450/' 'James K'
link San Francisco 1 'bit.ly/5180/' 'Billy N'
link San Francisco 1 'bit.ly/7847/' 'John S'
flatten_json()
函数(来自上面的 Medium.com)破坏了层次结构。这是前几行:
{'city_url': 'link',
'city_name_0': 'San Francisco',
'rank': 1,
'resident_0_link_0': 'bit.ly/0842/',
'resident_0_name_0': 'John A', ...
有人可以帮我考虑如何转换这些数据集吗?不幸的是,pandas
文档没有为初学者提供指导。这就是我在玩的东西。没有任何效果。
from pandas.io.json import json_normalize
json_normalize(d,['city',['name','rank']])
json_normalize(d,['city','name','rank'])
json_normalize(d,['city','name'])
如果有人指导如何进行这些类型的转换和思考过程,我将不胜感激。
此外,由于原始数据集中的数据量,我正在寻找矢量化操作或 O(N)
操作而不是 O(N2)
。因此,任何比 O(N)
慢的都不起作用。
如果你知道 json blob 的结构,这就可以了
resident_link = [k['link'][0] for k in d['resident']]
resident_name = [k['name'][0] for k in d['resident']]
n = len(d['resident'])
city_url = n * [d['city']['url']]
city_name = n * [d['city']['name'][0]]
rank = n * [d['rank']]
df = pandas.DataFrame({
'resident_name' : resident_name,
'resident_link' : resident_link,
'city_url' : city_url,
'city_name' : city_name,
'rank' : rank
})
产生
city_name city_url rank resident_link resident_name
0 San Francisco link 1 bit.ly/0842/ John A
1 San Francisco link 1 bit.ly/5835/ Tedd B
2 San Francisco link 1 bit.ly/2011/ Cobb C
3 San Francisco link 1 bit.ly/0855/ Jack N
4 San Francisco link 1 bit.ly/1430/ Jack K
5 San Francisco link 1 bit.ly/3081/ Edward
6 San Francisco link 1 bit.ly/2001/ Jack W
7 San Francisco link 1 bit.ly/0020/ Henry F
8 San Francisco link 1 bit.ly/2137/ Joseph S
9 San Francisco link 1 bit.ly/3225/ Ed B
10 San Francisco link 1 bit.ly/3667/ George Vvec
11 San Francisco link 1 bit.ly/6434/ Robert W
12 San Francisco link 1 bit.ly/4036/ Rudy B
13 San Francisco link 1 bit.ly/6450/ James K
14 San Francisco link 1 bit.ly/5180/ Billy N
15 San Francisco link 1 bit.ly/7847/ John S
编辑
正如 OP 在评论中所说,假设有很多这样的记录,每个记录都具有相同的结构
nrecords = 10
dd = {k : d for k in range(nrecords)}
dd
现在有原始 json blob 的 10 个副本。这就是代码应该如何更新
ff = pandas.DataFrame()
for record in range(nrecords):
n = len(dd[record]['resident'])
df = {
'resident_link' : [k['link'][0] for k in dd[record]['resident']],
'resident_name' : [k['name'][0] for k in dd[record]['resident']],
'city_url' : n * [dd[record]['city']['url']],
'city_name' : n * [dd[record]['city']['name'][0]],
'rank' : n * [dd[record]['rank']]
}
df = pandas.DataFrame(df)
ff = ff.append(df).reset_index(drop = True)
下面是 运行 时间的估计值,它是记录数的函数。基于此,完成 150 万条记录大约需要 1 小时