展平 pandas 包含字典列表的数据框列
Flatten pandas dataframe column containing list of dictionaries
我正在展平一个数据框,其中列包含字典列表。我已经为它写了代码。但是,仅处理 5000 行就需要大约 25 秒,这已经很多了。
这是示例数据集:
event_date timestamp event_name user_properties
20191117 1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117 1.57401E+15 screen_view [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117 1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117 1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117 1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
这是解析后的数据帧:
结果包含 'key' 作为列,但是,如果字典中有 'set_timestamp_micros' 键,则该列的语法为 {key}.set_timestamp_micros.
这是扁平化数据帧的代码:
def normalize_complex_column_v2(df, df_copy, column):
col_list = []
for index,row in df.iterrows():
for element in row[column]:
cols = [element['key']]
cols += ["%s.%s"%(element['key'],key) for key in element['value'].keys() if 'timestamp' in key]
df_copy= df_copy.reindex(columns=list(dict.fromkeys(df_copy.columns.tolist() + cols)))
df_copy.loc[index,cols] = [value for key,value in element['value'].items() if value is not None]
df_copy.drop([column], axis=1, inplace=True)
return df_copy
如何优化此代码?
更新:
有什么方法可以使用 swifter 来优化我的功能吗?
Numba 问题:
<ipython-input-101-15265d3af7fb>:1: NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function "flatten_dataframe_column" failed type inference due to: Untyped global name 'defaultdict': cannot determine Numba type of <class 'type'>
File "<ipython-input-101-15265d3af7fb>", line 4:
def flatten_dataframe_column(df,column,fetch_timestamp=True):
<source elided>
temp_dict = df[column].to_dict()
new_dict = defaultdict(dictLoweringError: Failed in object mode pipeline (step: object mode backend)
.3.182
File "<ipython-input-101-15265d3af7fb>", line 16:
def flatten_dataframe_column(df,column,fetch_timestamp=True):
<source elided>
elements['key'] : [value for key,value in elements['value'].items() \
if (value is not None and 'timestamp' not in key)][0]
^
[1] During: lowering ".3.182 = unary(fn=<built-in function not_>, value=.3.182)" at <ipython-input-101-15265d3af7fb> (16)
-------------------------------------------------------------------------------
This should not have happened, a problem has occurred in Numba's internals.
You are currently using Numba version 0.47.0.
Please report the error message and traceback, along with a minimal reproducer
at: https://github.com/numba/numba/issues/new
If more help is needed please feel free to speak to the Numba core developers
directly at: https://gitter.im/numba/numba
Thanks in advance for your help in improving Numba!
)
对于初学者,您可以遍历列中的值而不是整个数据帧并释放一些内存。其次,您的列表推导是 "loopy." 第三,复制和删除数据框在计算上效率低下。
def normalize_complex_column_v2(df, df_copy, column):
col_list = []
for i, element in enumerate(df[column].values):
# Get dictionary in list
element = element[0] if type(element)==list else None
# Optimize the code below by more efficiently looping through keys
cols = [key for key in element.get('value').keys() if 'timestamp' in key]
# Get values for your column:
for key in cols:
df.iloc[i, key] = element.get('value').get(key)
# Now create the column for `element['key']`
df.iloc[i, element.get('key')] = 'foo' # Some value for this, not sure where you're pulling from...
return df
这应该可以解决问题。让我知道它的比较情况!
我将数据框列转换为字典并在那里处理数据。然后将处理后的字典转换为dataframe,并通过'index'与原始dataframe连接。
处理 50 万条记录大约需要 8 秒。
def flatten_dataframe_column(df,column):
temp_dict = df[column].to_dict()
new_dict = defaultdict(dict)
for item in temp_dict.items():
for elements in item[1]:
new_dict[item[0]].update(
{
(elements['key']+'.set_timestamp_micros') : elements['value']['set_timestamp_micros']
}
)
new_dict[item[0]].update(
{
elements['key'] : [value for key,value in elements['value'].items() \
if (value is not None and 'timestamp' not in key)][0]
}
)
return pd.DataFrame.from_dict(new_dict,orient='index')
如果有人能想到更优的解决方案,请post。
我正在展平一个数据框,其中列包含字典列表。我已经为它写了代码。但是,仅处理 5000 行就需要大约 25 秒,这已经很多了。
这是示例数据集:
event_date timestamp event_name user_properties
20191117 1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117 1.57401E+15 screen_view [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117 1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117 1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117 1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
这是解析后的数据帧:
结果包含 'key' 作为列,但是,如果字典中有 'set_timestamp_micros' 键,则该列的语法为 {key}.set_timestamp_micros.
这是扁平化数据帧的代码:
def normalize_complex_column_v2(df, df_copy, column):
col_list = []
for index,row in df.iterrows():
for element in row[column]:
cols = [element['key']]
cols += ["%s.%s"%(element['key'],key) for key in element['value'].keys() if 'timestamp' in key]
df_copy= df_copy.reindex(columns=list(dict.fromkeys(df_copy.columns.tolist() + cols)))
df_copy.loc[index,cols] = [value for key,value in element['value'].items() if value is not None]
df_copy.drop([column], axis=1, inplace=True)
return df_copy
如何优化此代码?
更新: 有什么方法可以使用 swifter 来优化我的功能吗?
Numba 问题:
<ipython-input-101-15265d3af7fb>:1: NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function "flatten_dataframe_column" failed type inference due to: Untyped global name 'defaultdict': cannot determine Numba type of <class 'type'>
File "<ipython-input-101-15265d3af7fb>", line 4:
def flatten_dataframe_column(df,column,fetch_timestamp=True):
<source elided>
temp_dict = df[column].to_dict()
new_dict = defaultdict(dictLoweringError: Failed in object mode pipeline (step: object mode backend)
.3.182
File "<ipython-input-101-15265d3af7fb>", line 16:
def flatten_dataframe_column(df,column,fetch_timestamp=True):
<source elided>
elements['key'] : [value for key,value in elements['value'].items() \
if (value is not None and 'timestamp' not in key)][0]
^
[1] During: lowering ".3.182 = unary(fn=<built-in function not_>, value=.3.182)" at <ipython-input-101-15265d3af7fb> (16)
-------------------------------------------------------------------------------
This should not have happened, a problem has occurred in Numba's internals.
You are currently using Numba version 0.47.0.
Please report the error message and traceback, along with a minimal reproducer
at: https://github.com/numba/numba/issues/new
If more help is needed please feel free to speak to the Numba core developers
directly at: https://gitter.im/numba/numba
Thanks in advance for your help in improving Numba!
)
对于初学者,您可以遍历列中的值而不是整个数据帧并释放一些内存。其次,您的列表推导是 "loopy." 第三,复制和删除数据框在计算上效率低下。
def normalize_complex_column_v2(df, df_copy, column):
col_list = []
for i, element in enumerate(df[column].values):
# Get dictionary in list
element = element[0] if type(element)==list else None
# Optimize the code below by more efficiently looping through keys
cols = [key for key in element.get('value').keys() if 'timestamp' in key]
# Get values for your column:
for key in cols:
df.iloc[i, key] = element.get('value').get(key)
# Now create the column for `element['key']`
df.iloc[i, element.get('key')] = 'foo' # Some value for this, not sure where you're pulling from...
return df
这应该可以解决问题。让我知道它的比较情况!
我将数据框列转换为字典并在那里处理数据。然后将处理后的字典转换为dataframe,并通过'index'与原始dataframe连接。 处理 50 万条记录大约需要 8 秒。
def flatten_dataframe_column(df,column):
temp_dict = df[column].to_dict()
new_dict = defaultdict(dict)
for item in temp_dict.items():
for elements in item[1]:
new_dict[item[0]].update(
{
(elements['key']+'.set_timestamp_micros') : elements['value']['set_timestamp_micros']
}
)
new_dict[item[0]].update(
{
elements['key'] : [value for key,value in elements['value'].items() \
if (value is not None and 'timestamp' not in key)][0]
}
)
return pd.DataFrame.from_dict(new_dict,orient='index')
如果有人能想到更优的解决方案,请post。