Pandas 数据缩减和合并

Pandas data reduction and merging

我正在使用如下所示的 Pandas(版本 0.17.1)DataFrame:

                         time   type   module     msg_type         content
36636 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property A' = some_value_1
36637 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property B' = some_value_2
36638 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property C' = some_value_3
36639 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property D' = some_value_4
36715 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 1' = some_value_a
36716 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 2' = some_value_b
36717 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 3' = some_value_c
36718 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 4' = some_value_d
36719 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 5' = some_value_e
36720 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 6' = some_value_f
36721 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 7' = some_value_g
36722 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 8' = some_value_h
36723 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 9' = some_value_i
36724 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 10' = some_value_j
36725 2016-08-25 17:59:50.964  ERROR   MOD_2_NAME  STATUS  Didn't receive Status Monitoring 'Parameter 11' from MODULE_2!
36726 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 12' = some_value_k
36727 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 13' = some_value_l
36785 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property A' = some_value_1
36786 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property B' = some_value_2
36787 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property C' = some_value_3
36788 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property D' = some_value_4
36827 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 1' = some_value_a
36828 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 2' = some_value_b
36829 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 3' = some_value_c
36830 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 4' = some_value_d
36831 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 5' = some_value_e
36832 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 6' = some_value_f
36833 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 7' = some_value_g
36834 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 8' = some_value_h
36835 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 9' = some_value_i
36836 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 10' = some_value_j
36837 2016-08-25 19:01:50.964  ERROR   MOD_2_NAME  STATUS  Didn't receive Status Monitoring 'Parameter 11' from MODULE_2!
36838 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 12' = some_value_k
36839 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 13' = some_value_l

(框架已经被缩小以删除不感兴趣的行。这就是索引列缺少数字的原因)

如您所见,同时从设备读取了多个参数。每个读数都是单独的一行。我想做一些 "reduction" 和 "compression" 以便每个读数只有一行。我还希望 content 列成为一本字典,这样我就可以轻松地查找感兴趣的特定项目。所以结果看起来像这样:

                         time   type   module     msg_type         content
36636 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  {'Property A' = 'some_value_1', 'Property B' = 'some_value_2', 'Property C' = 'some_value_3', 'Property D' = 'some_value_4'}
36715 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  {'Parameter 1' = 'some_value_a', 'Parameter 2' = 'some_value_b', 'Parameter 3' = 'some_value_c', 'Parameter 4' = 'some_value_d', 'Parameter 5' = 'some_value_e', 'Parameter 6' = 'some_value_f', 'Parameter 7' = 'some_value_g','Parameter 8' = some_value_h, 'Parameter 9' = 'some_value_i', 'Parameter 10' = 'some_value_j', 'Parameter 11' = '', 'Parameter 12' = 'some_value_k', 'Parameter 13' = 'some_value_l'}
36785 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  {'Property A' = 'some_value_1', 'Property B' = 'some_value_2', 'Property C' = 'some_value_3', 'Property D' = 'some_value_4'}
36827 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  {'Parameter 1' = 'some_value_a', 'Parameter 2' = 'some_value_b', 'Parameter 3' = 'some_value_c', 'Parameter 4' = 'some_value_d', 'Parameter 5' = 'some_value_e', 'Parameter 6' = 'some_value_f', 'Parameter 7' = 'some_value_g','Parameter 8' = some_value_h, 'Parameter 9' = 'some_value_i', 'Parameter 10' = 'some_value_j', 'Parameter 11' = '', 'Parameter 12' = 'some_value_k', 'Parameter 13' = 'some_value_l'}

所以基本上我希望所有行的 timemodule 列具有相同的值 "merged" 在一起,他们的 contents 列被解析为一本字典。 (也可能有一些"missing"或"empty"的读数。)我不想过滤或删除数据,只是减少和总结它。

我猜我需要 groupby()transform()apply() 的某种组合,但我不确定从哪里开始。

我的部分困难是我无法检查 groupby() 的结果以查看它是否按照我的要求进行。

g1 = df.groupby(['module', 'time'])

g1 没有出现在 Spyder 变量资源管理器中。 printing 没有显示任何内容。我无法访问属性 index 或在 g1 上调用 info()。但我怀疑 groupby() 在这里是否值得......我不想消除任何东西。

一直在搜索以查找示例,但不断得到看似误报的结果。任何入门帮助将不胜感激。

定义一个函数并使用groupby() and then apply():

In [235]: def create_data_dict(rows):
     ...:     return {k:v for k,v in re.findall(r"'([^']*)' = ([^ ]*)", ' '.join(rows.content.astype(str)))}
     ...: 

In [236]: df[df['type'] != 'ERROR'].groupby(['time', 'module', 'msg_type']).apply(create_data_dict).to_frame(name = 'content').reset_index()
Out[236]: 
                      time      module msg_type                                                                                                                                                                                                                                                                                                                                                                                                          content
0  2016-08-25 17:59:50.051  MOD_1_NAME   STATUS                                                                                                                                                                                                                                                                                 {u'Property A': u'some_value_1', u'Property C': u'some_value_3', u'Property B': u'some_value_2', u'Property D': u'some_value_4'}
1  2016-08-25 17:59:50.964  MOD_2_NAME   STATUS  {u'Parameter 6': u'some_value_f', u'Parameter 7': u'some_value_g', u'Parameter 4': u'some_value_d', u'Parameter 5': u'some_value_e', u'Parameter 2': u'some_value_b', u'Parameter 3': u'some_value_c', u'Parameter 1': u'some_value_a', u'Parameter 8': u'some_value_h', u'Parameter 9': u'some_value_i', u'Parameter 10': u'some_value_j', u'Parameter 12': u'some_value_k', u'Parameter 13': u'some_value_l'}
2  2016-08-25 18:59:50.051  MOD_1_NAME   STATUS                                                                                                                                                                                                                                                                                 {u'Property A': u'some_value_1', u'Property C': u'some_value_3', u'Property B': u'some_value_2', u'Property D': u'some_value_4'}
3  2016-08-25 19:01:50.964  MOD_2_NAME   STATUS  {u'Parameter 6': u'some_value_f', u'Parameter 7': u'some_value_g', u'Parameter 4': u'some_value_d', u'Parameter 5': u'some_value_e', u'Parameter 2': u'some_value_b', u'Parameter 3': u'some_value_c', u'Parameter 1': u'some_value_a', u'Parameter 8': u'some_value_h', u'Parameter 9': u'some_value_i', u'Parameter 10': u'some_value_j', u'Parameter 12': u'some_value_k', u'Parameter 13': u'some_value_l'}

要了解 pandas 中的群组,您应该查看 http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-object-attributes。另一种深入了解组的方法是简单地打印它们:

grouped = df.groupby(['A', 'B'])
print grouped.first() # prints the first group

# print each (name, group) tuple from grouped
for name, grp in grouped:
    print name
    print grp

我已经根据我所做的一些假设(请参阅下面的注释)为您制定了一个特定的解决方案:

import re
from collections import OrderedDict

df = pd.read_csv('/Users/shawnheide/Desktop/test.csv')

def custom_agg(contents):
    this_dict = OrderedDict()
    for content in contents:
        match = re.findall("Property \w+|Parameter \d+", content)
        if match:
            key = match[0]
            match = re.findall("some_value_\w+|some_value_\d+", content)
            if match:
                value = match[0]
            else:
                value = ''
        this_dict[key] = value
    return this_dict

grps = df.groupby(['time', 'module', ], as_index=False)
df_grp = grps.agg({'content': custom_agg})

输出:

time    module  content
0   2016-08-25 17:59:50.051 MOD_1_NAME  {'Property A': 'some_value_1', 'Property B': 'some_value_2', 'Property C': 'some_value_3', 'Property D': 'some_value_4'}
1   2016-08-25 17:59:50.964 MOD_2_NAME  {'Parameter 1': 'some_value_a', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 11': '', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}
2   2016-08-25 18:59:50.051 MOD_1_NAME  {'Property A': 'some_value_1', 'Property B': 'some_value_2', 'Property C': 'some_value_3', 'Property D': 'some_value_4'}
3   2016-08-25 19:01:50.964 MOD_2_NAME  {'Parameter 1': 'some_value_a', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 11': '', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}

需要考虑的问题:

因此,首先,您应该post您的数据采用其他人可以读取的格式(即 csv、tsv 等),这使得其他人导入和导入更容易帮你解决问题。

第二个问题是,在您提出的解决方案中,您有索引和 msg_type 列。鉴于您没有对这些列进行分组,这实际上没有意义,但实际上这只是需要考虑的事情。

最后,为了获得有序字典,您需要使用集合中的 OrderedDict 模块,因为 Python 字典不保持顺序(希望这个功能在 3.6 中出现)。

pv = df.set_index(['time', 'type', 'module', 'msg_type']) \
       .content.str.extract(r"'(?P<prop>.+)' = (?P<val>.+)", expand=True)

pv.groupby(level=[0, 2]).apply(lambda df: df.set_index('prop').val.to_dict())

2016-08-25 17:59:50.051,MOD_1_NAME,"{'Property A': 'some_value_1', 'Property C': 'some_value_3', 'Property B': 'some_value_2', 'Property D': 'some_value_4'}"
2016-08-25 17:59:50.964,MOD_2_NAME,"{'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 1': 'some_value_a', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}"
2016-08-25 18:59:50.051,MOD_1_NAME,"{'Property A': 'some_value_1', 'Property C': 'some_value_3', 'Property B': 'some_value_2', 'Property D': 'some_value_4'}"
2016-08-25 19:01:50.964,MOD_2_NAME,"{'Parameter 6': 'some_value_f', 'Parameter 7': 'some_value_g', 'Parameter 4': 'some_value_d', 'Parameter 5': 'some_value_e', 'Parameter 2': 'some_value_b', 'Parameter 3': 'some_value_c', 'Parameter 1': 'some_value_a', 'Parameter 8': 'some_value_h', 'Parameter 9': 'some_value_i', 'Parameter 10': 'some_value_j', 'Parameter 12': 'some_value_k', 'Parameter 13': 'some_value_l'}"