如何从字典列表中提取数据到 pandas 数据框中?
How to extract data from a list of dicts, into a pandas dataframe?
这是 json 文件的一部分,我在 运行 运行 使用电视节目 API 的 python 脚本之后作为输出获得。
[{"_": "Message", "id": 4589, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": []}, {"_": "MessageService", "id": 4588, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T11:48:18+00:00", "action": {"_": "MessageActionChatJoinedByLink", "inviter_id": 310378430}, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null}
如您所见,python 脚本从电报中的特定频道抓取了聊天记录。我只需要将 json 的日期和消息部分存储到一个单独的数据框中,这样我就可以应用适当的过滤器并提供适当的输出。谁能帮我解决这个问题?
我认为你应该使用 json 加载然后 json_normalize 将 json 转换为带有 max_level 的数据帧用于嵌套字典。
from pandas import json_normalize
import json
d = '[{"_": "Message", "id": 4589, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": []}, {"_": "MessageService", "id": 4588, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T11:48:18+00:00", "action": {"_": "MessageActionChatJoinedByLink", "inviter_id": 310378430}, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null}]'
f = json.loads(d)
print(json_normalize(f, max_level=2))
- 这假定从 API 返回的对象不是字符串(例如
'[{...}, {...}]'
。
- 如果是字符串,先用
data = json.loads(data)
。
-
'date'
和相应的 'message'
可以从 dicts
的 list
中提取 list-comprehension.
- 遍历
list
中的每个 dict
,并对 key
使用 dict.get
。如果密钥不存在,则返回 None
。
import pandas as pd
# where data is the list of dicts, unpack the desired keys and load into pandas
df = pd.DataFrame([{'date': i.get('date'), 'message': i.get('message')} for i in data])
# display(df)
date message
0 2020-09-03T14:51:03+00:00 Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same
1 2020-09-03T11:48:18+00:00 None
或者
- 如果您想跳过数据,其中
'message'
是 None
df = pd.DataFrame([{'date': i['date'], 'message': i['message']} for i in data if i.get('message')])
date message
2020-09-03T14:51:03+00:00 Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same
这是 json 文件的一部分,我在 运行 运行 使用电视节目 API 的 python 脚本之后作为输出获得。
[{"_": "Message", "id": 4589, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": []}, {"_": "MessageService", "id": 4588, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T11:48:18+00:00", "action": {"_": "MessageActionChatJoinedByLink", "inviter_id": 310378430}, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null}
如您所见,python 脚本从电报中的特定频道抓取了聊天记录。我只需要将 json 的日期和消息部分存储到一个单独的数据框中,这样我就可以应用适当的过滤器并提供适当的输出。谁能帮我解决这个问题?
我认为你应该使用 json 加载然后 json_normalize 将 json 转换为带有 max_level 的数据帧用于嵌套字典。
from pandas import json_normalize
import json
d = '[{"_": "Message", "id": 4589, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": []}, {"_": "MessageService", "id": 4588, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T11:48:18+00:00", "action": {"_": "MessageActionChatJoinedByLink", "inviter_id": 310378430}, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null}]'
f = json.loads(d)
print(json_normalize(f, max_level=2))
- 这假定从 API 返回的对象不是字符串(例如
'[{...}, {...}]'
。- 如果是字符串,先用
data = json.loads(data)
。
- 如果是字符串,先用
-
'date'
和相应的'message'
可以从dicts
的list
中提取 list-comprehension. - 遍历
list
中的每个dict
,并对key
使用dict.get
。如果密钥不存在,则返回None
。
import pandas as pd
# where data is the list of dicts, unpack the desired keys and load into pandas
df = pd.DataFrame([{'date': i.get('date'), 'message': i.get('message')} for i in data])
# display(df)
date message
0 2020-09-03T14:51:03+00:00 Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same
1 2020-09-03T11:48:18+00:00 None
或者
- 如果您想跳过数据,其中
'message'
是None
df = pd.DataFrame([{'date': i['date'], 'message': i['message']} for i in data if i.get('message')])
date message
2020-09-03T14:51:03+00:00 Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same