pandas 来自嵌套字典的数据框（elasticsearch 结果）

Question

我很难将 elasticsearch 聚合的结果转换为 pandas。我正在尝试编写一个抽象函数，它将采用嵌套字典（任意数量的级别）并将它们展平为 pandas 数据框

这是典型的结果

-- 编辑：我也添加了父键

x1 = {u'xColor': {u'buckets': [{u'doc_count': 4,
u'key': u'red',
u'xMake': {u'buckets': [{u'doc_count': 3,
   u'key': u'honda',
   u'xCity': {u'buckets': [{u'doc_count': 2, u'key': u'ROME'},
     {u'doc_count': 1, u'key': u'Paris'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}},
  {u'doc_count': 1,
   u'key': u'bmw',
   u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Paris'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}}],
 u'doc_count_error_upper_bound': 0,
 u'sum_other_doc_count': 0}},
 {u'doc_count': 2,
u'key': u'blue',
u'xMake': {u'buckets': [{u'doc_count': 1,
   u'key': u'ford',
   u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Paris'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}},
  {u'doc_count': 1,
   u'key': u'toyota',
   u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}}],
 u'doc_count_error_upper_bound': 0,
   u'sum_other_doc_count': 0}},
    {u'doc_count': 2,
u'key': u'green',
u'xMake': {u'buckets': [{u'doc_count': 1,
   u'key': u'ford',
   u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}},
    {u'doc_count': 1,
      u'key': u'toyota',
     u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
    u'doc_count_error_upper_bound': 0,
    u'sum_other_doc_count': 0}}],
 u'doc_count_error_upper_bound': 0,
 u'sum_other_doc_count': 0}}],
 u'doc_count_error_upper_bound': 0,
 u'sum_other_doc_count': 0}}

我想要的是具有最低级别 doc_count 的数据框

第一条记录

 red-honda-rome-2 

 red-honda-paris-1

 red-bmw-paris-1

我在 pandas here 中遇到了 json_normalize，但不明白如何放置参数，我看到了关于展平嵌套字典的不同建议，但真的不能了解它们是如何工作的。任何帮助我开始的帮助将不胜感激 Elasticsearch result to table

更新

我尝试使用 dpath 这是一个很棒的库，但我不知道如何抽象它（以仅将存储桶名称作为参数的函数形式），因为 dpath 无法处理结构其中值是列表（而不是其他字典）

import dpath 
import pandas as pd 

xListData = []
for q1 in dpath.util.get(x1, 'xColor/buckets'):
      xColor = q1['key']
for q2 in dpath.util.get(q1, 'xMake/buckets'):
    #print '--', q2['key']
    xMake = q2['key']
    for q3 in dpath.util.get(q2, 'xCity/buckets'):
        #xDict = []
        xCity = q3['key']
        doc_count = q3['doc_count']
        xDict = {'color': xColor, 'make': xMake, 'city': xCity, 'doc_count': doc_count}
        #print '------', q3['key'], q3['doc_count']
        xListData.append(xDict)

pd.DataFrame(xListData)

这给出：

city    color   doc_count   make
0   ROME    red     2   honda
1   Paris   red     1   honda
2   Paris   red     1   bmw
3   Paris   blue    1   ford
4   Berlin  blue    1   toyota
5   Berlin  green   1   ford
6   Berlin  green   1   toyota

Answer 1

尝试使用递归函数：

import pandas as pd
def elasticToDataframe(elasticResult,aggStructure,record={},fulllist=[]):
    for agg in aggStructure:
        buckets = elasticResult[agg['key']]['buckets']
        for bucket in buckets:
            record = record.copy()
            record[agg['key']] = bucket['key']
            if 'aggs' in agg: 
                elasticToDataframe(bucket,agg['aggs'],record,fulllist)
            else: 
                for var in agg['variables']:
                    record[var['dfName']] = bucket[var['elasticName']]

                fulllist.append(record)

    df = pd.DataFrame(fulllist)
    return df

然后使用您的数据 (x1) 和正确配置的 'aggStructure' 字典调用该函数。数据的嵌套性质必须在这个dict中体现出来。

aggStructure=[{'key':'xColor','aggs':[{'key':'xMake','aggs':[{'key':'xCity','variables':[{'elasticName':'doc_count','dfName':'count'}]}]}]}]
elasticToDataframe(x1,aggStructure)

干杯

Answer 2

有一个项目可以开箱即用：https://github.com/onesuper/pandasticsearch

也可以使用递归生成器和 MultiIndex 功能手动完成：

https://github.com/onesuper/pandasticsearch/blob/master/pandasticsearch/query.py#L125

pandas 来自嵌套字典的数据框（elasticsearch 结果）

pandas dataframe from a nested dictionary (elasticsearch result)

python

json

elasticsearch

pandas