使用列表中包含字典的行扩展数据框
extend dataframe with rows containing dictionary in list
我有大约 300.000 行如下,但我需要的只是 ID 和电子邮件地址。像这样的数据框:
d = {'vid': [1201,1202], 'col2': [[{'vid': 1201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0,
'identities': [{'type': 'EMAIL', 'value': 'abc@gmaill.com', 'timestamp': 1548608578090, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': '69c4f6ec-e0e9-4632-8d16-cbc204a57b22', 'timestamp': 1548608578106}]},
{'vid': 314479851, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 183374504, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17543251, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 99700201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 65375052, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17525601, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 238128701, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []}],
[{'vid': 1202, 'saved-at-timestamp': 1548608578109, 'deleted-changed-timestamp': 0,
'identities': [{'type': 'EMAIL', 'value': 'xyz@gmaill.com', 'timestamp': 1548608578088, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': 'fe6c2628-b1db-47c5-91f6-258e79ea58f0', 'timestamp': 1548608578106}]}]]}
df=pd.DataFrame(d)
df
vid col2
1201 [{'vid': 1201, 'saved-at-timestamp': 1638824550030........
1202 [{'vid': 1202, 'saved-at-timestamp': 1548608578109......
预期输出(只有两个字段,但对于所有行):
vid email
1201 abc@gmaill.com
1202 xyz@gmaill.com
.. ..
这是使用 json_normalize
的一种方法:
out = (pd.concat(pd.json_normalize(lst, ['identities'], 'vid') for lst in d['col2'])
.pipe(lambda x: x[x['type']=='EMAIL'])[['vid','value']]
.rename(columns={'value':'email'}))
或者仅对“电子邮件”重复使用 str
访问器:
df=pd.DataFrame(d)
df['email'] = df['col2'].str[0].str.get('identities').str[0].str.get('value')
df = df.drop(columns='col2')
输出:
vid email
0 1201 abc@gmaill.com
0 1202 xyz@gmaill.com
您可以使用 pd.json_normalize
:
df = pd.json_normalize([sub for item in d['col2'] for sub in item], record_path='identities', meta='vid')
输出:
>>> df
type value timestamp is-primary vid
0 EMAIL abc@gmaill.com 1548608578090 True 1201
1 LEAD_GUID 69c4f6ec-e0e9-4632-8d16-cbc204a57b22 1548608578106 NaN 1201
2 EMAIL xyz@gmaill.com 1548608578088 True 1202
3 LEAD_GUID fe6c2628-b1db-47c5-91f6-258e79ea58f0 1548608578106 NaN 1202
现在只需使用 .loc
即可获取您想要的数据:
df = df.loc[df['type'] == 'EMAIL', ['vid', 'value']]
输出:
>>> df
vid value
0 1201 abc@gmaill.com
2 1202 xyz@gmaill.com
或者您可以在使用 json_normalize
之后旋转数据框,而不是使用 .loc
:
df = df.pivot(index='vid', columns='type', values='value').rename_axis(None, axis=1).reset_index()
输出:
>>> df
vid EMAIL LEAD_GUID
0 1201 abc@gmaill.com 69c4f6ec-e0e9-4632-8d16-cbc204a57b22
1 1202 xyz@gmaill.com fe6c2628-b1db-47c5-91f6-258e79ea58f0
我有大约 300.000 行如下,但我需要的只是 ID 和电子邮件地址。像这样的数据框:
d = {'vid': [1201,1202], 'col2': [[{'vid': 1201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0,
'identities': [{'type': 'EMAIL', 'value': 'abc@gmaill.com', 'timestamp': 1548608578090, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': '69c4f6ec-e0e9-4632-8d16-cbc204a57b22', 'timestamp': 1548608578106}]},
{'vid': 314479851, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 183374504, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17543251, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 99700201, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 65375052, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 17525601, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []},
{'vid': 238128701, 'saved-at-timestamp': 1638824550030, 'deleted-changed-timestamp': 0, 'identities': []}],
[{'vid': 1202, 'saved-at-timestamp': 1548608578109, 'deleted-changed-timestamp': 0,
'identities': [{'type': 'EMAIL', 'value': 'xyz@gmaill.com', 'timestamp': 1548608578088, 'is-primary': True},
{'type': 'LEAD_GUID', 'value': 'fe6c2628-b1db-47c5-91f6-258e79ea58f0', 'timestamp': 1548608578106}]}]]}
df=pd.DataFrame(d)
df
vid col2
1201 [{'vid': 1201, 'saved-at-timestamp': 1638824550030........
1202 [{'vid': 1202, 'saved-at-timestamp': 1548608578109......
预期输出(只有两个字段,但对于所有行):
vid email
1201 abc@gmaill.com
1202 xyz@gmaill.com
.. ..
这是使用 json_normalize
的一种方法:
out = (pd.concat(pd.json_normalize(lst, ['identities'], 'vid') for lst in d['col2'])
.pipe(lambda x: x[x['type']=='EMAIL'])[['vid','value']]
.rename(columns={'value':'email'}))
或者仅对“电子邮件”重复使用 str
访问器:
df=pd.DataFrame(d)
df['email'] = df['col2'].str[0].str.get('identities').str[0].str.get('value')
df = df.drop(columns='col2')
输出:
vid email
0 1201 abc@gmaill.com
0 1202 xyz@gmaill.com
您可以使用 pd.json_normalize
:
df = pd.json_normalize([sub for item in d['col2'] for sub in item], record_path='identities', meta='vid')
输出:
>>> df
type value timestamp is-primary vid
0 EMAIL abc@gmaill.com 1548608578090 True 1201
1 LEAD_GUID 69c4f6ec-e0e9-4632-8d16-cbc204a57b22 1548608578106 NaN 1201
2 EMAIL xyz@gmaill.com 1548608578088 True 1202
3 LEAD_GUID fe6c2628-b1db-47c5-91f6-258e79ea58f0 1548608578106 NaN 1202
现在只需使用 .loc
即可获取您想要的数据:
df = df.loc[df['type'] == 'EMAIL', ['vid', 'value']]
输出:
>>> df
vid value
0 1201 abc@gmaill.com
2 1202 xyz@gmaill.com
或者您可以在使用 json_normalize
之后旋转数据框,而不是使用 .loc
:
df = df.pivot(index='vid', columns='type', values='value').rename_axis(None, axis=1).reset_index()
输出:
>>> df
vid EMAIL LEAD_GUID
0 1201 abc@gmaill.com 69c4f6ec-e0e9-4632-8d16-cbc204a57b22
1 1202 xyz@gmaill.com fe6c2628-b1db-47c5-91f6-258e79ea58f0