重塑 jq 嵌套文件并制作 csv
reshape jq nested file and make csv
我一整天都在为这个问题苦苦挣扎,我想把它转换成 csv。
代表在英国公司大楼API中编号为"OC418979"的公司的附属人员API。
我已经将 json 截断为仅在 "items" 中包含 2 个对象。
我想要得到的是这样的csv
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...
还有 2 种额外的复杂情况:有 2 种类型 "officers",一些是人,一些是公司,所以并非所有关键人物都存在于另一种中,反之亦然。我希望这些条目是 'null'。第二个复杂问题是那些嵌套对象,如 "name" 其中包含一个逗号!或地址,其中包含几个子对象(我想我可以在 pandas 中将其展平)。
{
"total_results": 13,
"resigned_count": 9,
"links": {
"self": "/company/OC418979/officers"
},
"items_per_page": 35,
"etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
"kind": "officer-list",
"active_count": 4,
"inactive_count": 0,
"start_index": 0,
"items": [
{
"officer_role": "llp-designated-member",
"name": "BARRICK, David James",
"date_of_birth": {
"year": 1984,
"month": 1
},
"appointed_on": "2017-09-15",
"country_of_residence": "England",
"address": {
"country": "United Kingdom",
"address_line_1": "Old Gloucester Street",
"locality": "London",
"premises": "27",
"postal_code": "WC1N 3AX"
},
"links": {
"officer": {
"appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
}
}
},
{
"links": {
"officer": {
"appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
}
},
"address": {
"locality": "Tadcaster",
"country": "United Kingdom",
"address_line_1": "Westgate",
"postal_code": "LS24 9AB",
"premises": "5a"
},
"identification": {
"legal_authority": "UK",
"identification_type": "non-eea",
"legal_form": "UK"
},
"name": "PREMIER DRIVER LIMITED",
"officer_role": "corporate-llp-designated-member",
"appointed_on": "2017-09-15"
}
]
}
我一直在做的是创建新的 json 对象来提取我需要的字段,如下所示:
{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}
但是查询运行了几个小时——我确信有一种更快的方法。
现在我正在尝试这种新方法 - 创建一个 json ,其根是公司编号,并将其官员列表作为参数。
{(.links.self | split("/")[2]): .items[]}
好的,您想扫描军官列表,如果存在则从中提取一些字段并以 csv 格式写入。
第一部分是从 json 中提取数据。假设你加载的是一个 data
Python 对象,你有:
print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
data['items'][0]['country_of_residence'])
给出:
llp-designated-member 2017-09-15 England
是时候将所有内容与 csv 模块放在一起了:
import csv
...
with open('output.csv', 'w', newline='') as fd:
wr = csv.writer(fd)
for officer in data['items']:
_ = wr.writerow(('OC418979',
officer.get('country_of_residence',''),
officer.get('officer_role', ''),
officer.get('appointed_on', '')
))
如果键不存在,字典上的 get
方法允许使用默认值(此处为空字符串),而 csv
模块确保如果字段包含逗号, 它将用引号引起来。
根据您的示例输入,它给出:
OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15
使用 jq,可以更轻松地从将要共享的顶级对象中提取值并生成所需的行。您需要将浏览项目的次数限制为最多一次。
$ jq -r '(.links.self | split("/")[2]) as $companyCode
| .items[]
| [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
| @csv
' input.json
我一整天都在为这个问题苦苦挣扎,我想把它转换成 csv。
代表在英国公司大楼API中编号为"OC418979"的公司的附属人员API。
我已经将 json 截断为仅在 "items" 中包含 2 个对象。
我想要得到的是这样的csv
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...
还有 2 种额外的复杂情况:有 2 种类型 "officers",一些是人,一些是公司,所以并非所有关键人物都存在于另一种中,反之亦然。我希望这些条目是 'null'。第二个复杂问题是那些嵌套对象,如 "name" 其中包含一个逗号!或地址,其中包含几个子对象(我想我可以在 pandas 中将其展平)。
{
"total_results": 13,
"resigned_count": 9,
"links": {
"self": "/company/OC418979/officers"
},
"items_per_page": 35,
"etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
"kind": "officer-list",
"active_count": 4,
"inactive_count": 0,
"start_index": 0,
"items": [
{
"officer_role": "llp-designated-member",
"name": "BARRICK, David James",
"date_of_birth": {
"year": 1984,
"month": 1
},
"appointed_on": "2017-09-15",
"country_of_residence": "England",
"address": {
"country": "United Kingdom",
"address_line_1": "Old Gloucester Street",
"locality": "London",
"premises": "27",
"postal_code": "WC1N 3AX"
},
"links": {
"officer": {
"appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
}
}
},
{
"links": {
"officer": {
"appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
}
},
"address": {
"locality": "Tadcaster",
"country": "United Kingdom",
"address_line_1": "Westgate",
"postal_code": "LS24 9AB",
"premises": "5a"
},
"identification": {
"legal_authority": "UK",
"identification_type": "non-eea",
"legal_form": "UK"
},
"name": "PREMIER DRIVER LIMITED",
"officer_role": "corporate-llp-designated-member",
"appointed_on": "2017-09-15"
}
]
}
我一直在做的是创建新的 json 对象来提取我需要的字段,如下所示:
{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}
但是查询运行了几个小时——我确信有一种更快的方法。
现在我正在尝试这种新方法 - 创建一个 json ,其根是公司编号,并将其官员列表作为参数。
{(.links.self | split("/")[2]): .items[]}
好的,您想扫描军官列表,如果存在则从中提取一些字段并以 csv 格式写入。
第一部分是从 json 中提取数据。假设你加载的是一个 data
Python 对象,你有:
print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
data['items'][0]['country_of_residence'])
给出:
llp-designated-member 2017-09-15 England
是时候将所有内容与 csv 模块放在一起了:
import csv
...
with open('output.csv', 'w', newline='') as fd:
wr = csv.writer(fd)
for officer in data['items']:
_ = wr.writerow(('OC418979',
officer.get('country_of_residence',''),
officer.get('officer_role', ''),
officer.get('appointed_on', '')
))
如果键不存在,字典上的 get
方法允许使用默认值(此处为空字符串),而 csv
模块确保如果字段包含逗号, 它将用引号引起来。
根据您的示例输入,它给出:
OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15
使用 jq,可以更轻松地从将要共享的顶级对象中提取值并生成所需的行。您需要将浏览项目的次数限制为最多一次。
$ jq -r '(.links.self | split("/")[2]) as $companyCode
| .items[]
| [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
| @csv
' input.json