Pandas json_normalize 字典列表到指定列
Pandas json_normalize list of dictionaries into specified columns
我有一个数据框,其中有一列字典列表,看起来像
[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}]
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}]
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
我预计 json_normalize 会将数据放入像
这样的列中
first_open_time.int_value, first_open_time.set_timestamp_micros, User_dedication.string_value, User_dedication.set_timestamp_micros, etc.
相反,它只是用字典将其分成 7 列:
{'first_open_time.int_value': '1652796000000', 'first_open_time.set_timestamp_micros': '1652792823456000'} {'User_dedication.string_value': '1', 'User_dedication.set_timestamp_micros': '1653137417352000'} {'User_activity.string_value': '1', 'User_activity.set_timestamp_micros': '1653136561498000'}
这看起来几乎是我需要的,但仍然是字典。有些行是乱序的,就像第一个例子一样。
我尝试指定 Meta(正如我从一些手册中了解到的那样)
df3 = pd.json_normalize(df3,
meta=[['first_open_time', 'int_value'], ['first_open_time', 'set_timestamp_micros'],
['User_dedication', 'string_value'], ['User_dedication', 'set_timestamp_micros'],
['User_activity', 'string_value'], ['User_activity', 'set_timestamp_micros'],
['Minutes_in_app', 'string_value'], ['Minutes_in_app', 'set_timestamp_micros'],
['ga_session_number', 'int_value'], ['ga_session_number', 'set_timestamp_micros'],
['Paying_user', 'string_value'], ['Paying_user', 'set_timestamp_micros'],
['ga_session_id', 'int_value'], ['ga_session_id', 'set_timestamp_micros']])
但它给出了 AttributeError: 'list' object has no attribute 'values'.
也许有些问题是因为某些行中的字典顺序不正确,而某些行中的字典数量较少。这就是 Bigquery 放置事件的方式。
有什么办法可以解决这个问题吗?也许对所有行的字典进行排序,以便所有行都按顺序排列或指定每一列以及应该放哪个值?
json_normalize
可以应用于原始数据的每个元素。但是,如果不展平返回的 df,您将获得很多 Nan
值。
您可以继续循环,连接所有展平的行:
data = [[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}],
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}],
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
]
df = pd.DataFrame()
for d in data:
df_tmp = pd.json_normalize(d)
row = pd.DataFrame(df_tmp.to_numpy().flatten()).T.dropna(axis=1)
row.columns = df_tmp.columns
df = pd.concat([df, row])
print(df.reset_index(drop=True))
输出:
first_open_time.int_value first_open_time.set_timestamp_micros User_dedication.string_value ... Paying_user.set_timestamp_micros ga_session_id.int_value ga_session_id.set_timestamp_micros
0 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
1 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
2 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
3 1653195600000 1653193960416000 NaN ... NaN NaN NaN
4 1653102000000 1653098744032000 1 ... NaN 1653165977 1653165977727000
[5 rows x 14 columns]
我有一个数据框,其中有一列字典列表,看起来像
[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}]
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}]
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
我预计 json_normalize 会将数据放入像
这样的列中first_open_time.int_value, first_open_time.set_timestamp_micros, User_dedication.string_value, User_dedication.set_timestamp_micros, etc.
相反,它只是用字典将其分成 7 列:
{'first_open_time.int_value': '1652796000000', 'first_open_time.set_timestamp_micros': '1652792823456000'} {'User_dedication.string_value': '1', 'User_dedication.set_timestamp_micros': '1653137417352000'} {'User_activity.string_value': '1', 'User_activity.set_timestamp_micros': '1653136561498000'}
这看起来几乎是我需要的,但仍然是字典。有些行是乱序的,就像第一个例子一样。
我尝试指定 Meta(正如我从一些手册中了解到的那样)
df3 = pd.json_normalize(df3,
meta=[['first_open_time', 'int_value'], ['first_open_time', 'set_timestamp_micros'],
['User_dedication', 'string_value'], ['User_dedication', 'set_timestamp_micros'],
['User_activity', 'string_value'], ['User_activity', 'set_timestamp_micros'],
['Minutes_in_app', 'string_value'], ['Minutes_in_app', 'set_timestamp_micros'],
['ga_session_number', 'int_value'], ['ga_session_number', 'set_timestamp_micros'],
['Paying_user', 'string_value'], ['Paying_user', 'set_timestamp_micros'],
['ga_session_id', 'int_value'], ['ga_session_id', 'set_timestamp_micros']])
但它给出了 AttributeError: 'list' object has no attribute 'values'.
也许有些问题是因为某些行中的字典顺序不正确,而某些行中的字典数量较少。这就是 Bigquery 放置事件的方式。
有什么办法可以解决这个问题吗?也许对所有行的字典进行排序,以便所有行都按顺序排列或指定每一列以及应该放哪个值?
json_normalize
可以应用于原始数据的每个元素。但是,如果不展平返回的 df,您将获得很多 Nan
值。
您可以继续循环,连接所有展平的行:
data = [[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}],
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}],
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
]
df = pd.DataFrame()
for d in data:
df_tmp = pd.json_normalize(d)
row = pd.DataFrame(df_tmp.to_numpy().flatten()).T.dropna(axis=1)
row.columns = df_tmp.columns
df = pd.concat([df, row])
print(df.reset_index(drop=True))
输出:
first_open_time.int_value first_open_time.set_timestamp_micros User_dedication.string_value ... Paying_user.set_timestamp_micros ga_session_id.int_value ga_session_id.set_timestamp_micros
0 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
1 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
2 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
3 1653195600000 1653193960416000 NaN ... NaN NaN NaN
4 1653102000000 1653098744032000 1 ... NaN 1653165977 1653165977727000
[5 rows x 14 columns]