对 python 中的相似值进行分组和求和

grouping & summing similar values in python

我有这种格式的数据:

d = [
 {'key': '2018-05-10', 'vals': {'Clicks': 229, 'Link Clicks': 210}},
 {'key': '2018-05-11', 'vals': {'Clicks': 365, 'Link Clicks': 379}},

 {'key': '2018-05-10', 'vals': {'Clicks': 139, 'Link Clicks': 11}},
 {'key': '2018-05-11', 'vals': {'Clicks': 1348, 'Link Clicks': 73}},

]

即它有多个相同的条目 key

我希望它对其进行分组,以便 ClicksLink Clicks 对共同日期求和:

所以输出应该是这样的:

d = [
 {'key': '2018-05-10', 'vals': {'Clicks': 368, 'Link Clicks': 221}},
 {'key': '2018-05-11', 'vals': {'Clicks': 1713, 'Link Clicks': 452}},
]

我想到首先使用 defaultdict:

将值组合在一起
from collections import defaultdict

    dd = defaultdict(list)

    for i in d:                        
        dd[i['key']].append(i['vals'])

给出以下输出:

{ 2018-05-10': [
             {'Clicks': 229, 'Link Clicks': 210},
             {'Clicks': 139, 'Link Clicks': 11}
              ],
 '2018-05-11': [
             {'Clicks': 365, 'Link Clicks': 379},
             {'Clicks': 1348, 'Link Clicks': 73}
             ]}

现在我想我可以使用 Counter 来求和值,但我正在了解如何操作。键的名称,即 ClicksLink Clicks 可能会更改,并且 vals 可以有超过 2 个条目。

不使用defaultdict也能做到吗?有没有更好的方法?

注意:我认为使用 defaultdict 的这种方法并不好,因为我总是希望数据按日期排序,一旦我使用 dict,我就会松散顺序

from pprint import pprint
from collections import Counter, OrderedDict

d = {
'2018-05-10': [
             {'Clicks': 229, 'Link Clicks': 210},
             {'Clicks': 139, 'Link Clicks': 11}
              ],
 '2018-05-11': [
             {'Clicks': 365, 'Link Clicks': 379},
             {'Clicks': 1348, 'Link Clicks': 73}
             ],
}

m = OrderedDict()
for k, v in d.items():
    m[k] = Counter()
    for i in v:
        m[k].update(i)
    m[k] = dict(m[k])
    # or if you want to keep the 'vals' key and list:
    # m[k] = [{"vals": dict(m[k])}]

pprint(m)

输出:

OrderedDict([('2018-05-11', {'Clicks': 1713, 'Link Clicks': 452}),
             ('2018-05-10', {'Clicks': 368, 'Link Clicks': 221})])

您可以使用嵌套字典理解。相关的 c_type 键,即 ClicksLink Clicks,来自每个日期的第一个列表。否则,该方法自然会接受任意数量的类别。

res = {k: {'vals': {c_type: sum(item[c_type] for item in v) for c_type in v[0]}}
       for k, v in dd.items()}

{'2018-05-10': {'vals': {'Clicks': 368, 'Link Clicks': 221}},
 '2018-05-11': {'vals': {'Clicks': 1713, 'Link Clicks': 452}}}

我建议你的输出格式不是字典列表,其中每个字典都只有键 (key:vals),你应该只有 [=17] 的实际字典=]对!

这使得代码更清晰、更易读,并且使访问特定日期更简洁,因为您不需要遍历列表 (O(n)),您可以直接访问该日期并获得点击次数。

因此,例如:

dates = {}
for dd in d:
    dates.setdefault(dd['key'], []).append(dd['vals'])

dates = {k: {kk:sum(dd[kk] for dd in v) for kk in v[0].keys()} \
                                        for k,v in dates.items()}

给出:

{
  "2018-05-10": {
    "Clicks": 368,
    "Link Clicks": 221
  },
  "2018-05-11": {
    "Clicks": 1713,
    "Link Clicks": 452
  }
}

现在您可以直接通过以下方式获取特定日期的数据:

dates['2018-05-11']['Clicks']
#1713

如果您需要一个排序的字典列表(按日期),那么我们可以只使用我们当前的字典并为原始数据中的每个日期建立索引,因为它似乎已经排序:

order = [dd['key'] for dd in d]
date_list = sorted([{'key':k,'vals':v} for k,v in dates.items()], \
                                       key=lambda dd: order.index(dd['key']))

给出 date_list 作为日期排序的列表:

[
  {
    "key": "2018-05-10",
    "vals": {
      "Clicks": 368,
      "Link Clicks": 221
    }
  },
  {
    "key": "2018-05-11",
    "vals": {
      "Clicks": 1713,
      "Link Clicks": 452
    }
  }
]

我们可以将其归纳为基本的 "group-fold" 方法:

from operator import add, itemgetter

def group_fold(data, fold=add, key=itemgetter('key'), vals=itemgetter('vals')):
    result = {}
    for entry in data:
        ky = key(entry)
        vlb = vals(entry)
        vla = result.get(ky, None)
        if vla:
            for subk, subv in vl.items():
                if subk in vla:
                    vla[subk] = fold(vla[subk], subv)
                else:
                    vla[subk] = subv
        else:
            result[ky] = dict(vlb)
    return result

因此我们现在可以将其用作 group_fold(d),但我们可以自定义折叠功能,例如 multiply 而不是 add:

from operator import mul

group_fold(d, fold=mul)

使用嵌套的默认字典:

result = defaultdict(lambda: defaultdict(int))
for entry in d:
  for key, val in entry['vals'].items():
    result[entry['key']][key] += val

它会给你这样的结果:

{"2018-05-10": {"Clicks": 368, "Link Clicks": 221}, "2018-05-11": {"Clicks": 1713, "Link Clicks": 452}}

使用itertools.groupby

d =  [
 {'key': '2018-05-10', 'vals': {'Clicks': 368, 'Link Clicks': 221}},
 {'key': '2018-05-11', 'vals': {'Clicks': 1713, 'Link Clicks': 452}},
]

from itertools import groupby
from operator import itemgetter
newdict={}
for dt, k in groupby(sorted(d,key=itemgetter('key')),key=itemgetter('key')):
    for d in k:
        newdict[dt]=d['vals']

输出:

{'2018-05-10': {'Clicks': 368, 'Link Clicks': 221},
 '2018-05-11': {'Clicks': 1713, 'Link Clicks': 452}}
from collections import defaultdict, Counter, OrderedDict
ld = [{'key': '2018-05-10', 'vals': {'Clicks': 229, 'Link Clicks': 210}}, {'key': '2018-05-11', 'vals': {'Clicks': 365, 'Link Clicks': 379}}, {'key': '2018-05-10', 'vals': {'Clicks': 139, 'Link Clicks': 11}}, {'key': '2018-05-11', 'vals': {'Clicks': 1348, 'Link Clicks': 73}}]
out=defaultdict(Counter())
for d in ld:
    out[d['key']].update(d['vals'])

new = OrderedDict(sorted(out.items()))
print(new)
# OrderedDict([('2018-05-10', Counter({'Clicks': 368, 'Link Clicks': 221})), ('2018-05-11', Counter({'Clicks': 1713, 'Link Clicks': 452}))])

试试这个解决方案:

d = [
{'key': '2018-05-10', 'vals': {'Clicks': 229, 'Link Clicks': 210}},
{'key': '2018-06-01', 'vals': {'Clicks': 365, 'Link Clicks': 379}},

{'key': '2018-05-10', 'vals': {'Clicks': 139, 'Link Clicks': 11}},
{'key': '2018-06-01', 'vals': {'Clicks': 1348, 'Link Clicks': 73}},

]

final_dict = {}

for doc in d:
    date = doc['key']

    if date not in final_dict:
        final_dict[date] = {}

        for key in doc['vals']:
            final_dict[date][key] = doc['vals'][key]

    else:

        for key in doc['vals']:
            final_dict[date][key] += doc['vals'][key]


resp_dict = [{date: final_dict[date]} for date in sorted(final_dict)]

print resp_dict