JSON_Normalize(嵌套 json)到 csv
JSON_Normalize (nested json) to csv
我一直在尝试通过 pandas 从包含 json utf-8 编码数据的 txt 文件中提取数据。
直接 link 到数据文件 - http://download.companieshouse.gov.uk/psc-snapshot-2022-02-06_8of20.zip
数据结构如下例所示:
{"company_number":"04732933","data":{"address":{"address_line_1":"Windsor Road","locality":"Torquay","postal_code":"TQ1 1ST","premises":"Windsor Villas","region":"Devon"},"country_of_residence":"England","date_of_birth":{"month":1,"year":1964},"etag":"5623f35e4bb5dc9cb37e134cb2ac0ca3151cd01f","kind":"individual-person-with-significant-control","links":{"self":"/company/04732933/persons-with-significant-control/individual/8X3LALP5gAh5dAYEOYimeiRiJMQ"},"name":"Ms Karen Mychals","name_elements":{"forename":"Karen","surname":"Mychals","title":"Ms"},"nationality":"British","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}
{"company_number":"10118870","data":{"address":{"address_line_1":"Hilltop Road","address_line_2":"Bearpark","country":"England","locality":"Durham","postal_code":"DH7 7TL","premises":"54"},"ceased_on":"2019-04-15","country_of_residence":"England","date_of_birth":{"month":9,"year":1983},"etag":"5b3c984156794e5519851b7f1b22d1bbd2a5c5df","kind":"individual-person-with-significant-control","links":{"self":"/company/10118870/persons-with-significant-control/individual/hS6dYoZ234aXhmI6Q9y83QbAhSY"},"name":"Mr Patrick John Burns","name_elements":{"forename":"Patrick","middle_name":"John","surname":"Burns","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2017-04-06"}}
简单的pd.read_json
最初不起作用(我会得到 ValueError: Trailing data errors),直到使用 lines=true
(为此使用 jupyternotebook)。
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
这是通过 df.head()
显示数据结构的方式:
company_number data
0 06851805 {'address': {'address_line_1': 'Briar Road', '...
1 04732933 {'address': {'address_line_1': 'Windsor Road',...
2 10118870 {'address': {'address_line_1': 'Hilltop Road',...
3 10118870 {'address': {'address_line_1': 'Hilltop Road',...
4 09565353 {'address': {'address_line_1': 'Old Hertford R...
在查看了 Whosebug 和几个在线教程后,我尝试使用 pd.json_normalize(df)
,但一直收到 AttributeError: 'str' object has no attribute 'values'
错误。我想最终将此 json 文件导出到 csv 文件中。
提前感谢您的任何建议!
您可以通过仅将 json_normalize
应用于数据列来解决该问题。
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
#json_normalize
df2 = pd.json_normalize(df['data'])
df = pd.concat([df, df2], axis=1)
#output to csv
df.to_csv("./OUTPUT_FILE_NAME")
print(df)
company_number ... name_elements.middle_name
0 4732933 ... NaN
1 10118870 ... John
[2 rows x 24 columns]
我一直在尝试通过 pandas 从包含 json utf-8 编码数据的 txt 文件中提取数据。
直接 link 到数据文件 - http://download.companieshouse.gov.uk/psc-snapshot-2022-02-06_8of20.zip
数据结构如下例所示:
{"company_number":"04732933","data":{"address":{"address_line_1":"Windsor Road","locality":"Torquay","postal_code":"TQ1 1ST","premises":"Windsor Villas","region":"Devon"},"country_of_residence":"England","date_of_birth":{"month":1,"year":1964},"etag":"5623f35e4bb5dc9cb37e134cb2ac0ca3151cd01f","kind":"individual-person-with-significant-control","links":{"self":"/company/04732933/persons-with-significant-control/individual/8X3LALP5gAh5dAYEOYimeiRiJMQ"},"name":"Ms Karen Mychals","name_elements":{"forename":"Karen","surname":"Mychals","title":"Ms"},"nationality":"British","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}
{"company_number":"10118870","data":{"address":{"address_line_1":"Hilltop Road","address_line_2":"Bearpark","country":"England","locality":"Durham","postal_code":"DH7 7TL","premises":"54"},"ceased_on":"2019-04-15","country_of_residence":"England","date_of_birth":{"month":9,"year":1983},"etag":"5b3c984156794e5519851b7f1b22d1bbd2a5c5df","kind":"individual-person-with-significant-control","links":{"self":"/company/10118870/persons-with-significant-control/individual/hS6dYoZ234aXhmI6Q9y83QbAhSY"},"name":"Mr Patrick John Burns","name_elements":{"forename":"Patrick","middle_name":"John","surname":"Burns","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2017-04-06"}}
简单的pd.read_json
最初不起作用(我会得到 ValueError: Trailing data errors),直到使用 lines=true
(为此使用 jupyternotebook)。
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
这是通过 df.head()
显示数据结构的方式:
company_number data
0 06851805 {'address': {'address_line_1': 'Briar Road', '...
1 04732933 {'address': {'address_line_1': 'Windsor Road',...
2 10118870 {'address': {'address_line_1': 'Hilltop Road',...
3 10118870 {'address': {'address_line_1': 'Hilltop Road',...
4 09565353 {'address': {'address_line_1': 'Old Hertford R...
在查看了 Whosebug 和几个在线教程后,我尝试使用 pd.json_normalize(df)
,但一直收到 AttributeError: 'str' object has no attribute 'values'
错误。我想最终将此 json 文件导出到 csv 文件中。
提前感谢您的任何建议!
您可以通过仅将 json_normalize
应用于数据列来解决该问题。
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
#json_normalize
df2 = pd.json_normalize(df['data'])
df = pd.concat([df, df2], axis=1)
#output to csv
df.to_csv("./OUTPUT_FILE_NAME")
print(df)
company_number ... name_elements.middle_name
0 4732933 ... NaN
1 10118870 ... John
[2 rows x 24 columns]