列出最大日期分组依据

Question

假设我有一个这样的列表：

[   
    ['group1', 'type1', '2021-3-24'],
    ['group1', 'type1', '2021-3-25'],
    ['group1', 'type1', '2021-3-26'],
    ['group2', 'type2', '2022-5-21'],
    ['group2', 'type2', '2021-1-12'],
    ['group2', 'type2', '2021-3-26'],
]

我想要这些结果：

[   
    ['group1', 'type1', '2021-3-26'],
    ['group2', 'type2', '2022-5-21'],
]

父列表中的每个列表按 group 和 type 分组，执行的功能是“最大日期”操作。

SQL 语句等同于我要查找的内容：

select
    group,
    type,
    max(date)
from my_list
group by
    group,
    type

我想避免 Pandas 的开销，因为我认为这可以使用 itertools.groupby 完成并且我的数据集相对较小，但我找不到足够接近的示例了解这是如何工作的。

Answer 1

您可以使用 collections.defaultdict:

import collections, datetime
d = collections.defaultdict(list)
data = [['group1', 'type1', '2021-3-24'], ['group1', 'type1', '2021-3-25'], ['group1', 'type1', '2021-3-26'], ['group2', 'type2', '2022-5-21'], ['group2', 'type2', '2021-1-12'], ['group2', 'type2', '2021-3-26']]
for a, b, c in data:
  d[(a, b)].append(datetime.date(*map(int, c.split('-'))))

result = [[*a, str(max(b))] for a, b in d.items()]

输出：

[['group1', 'type1', '2021-03-26'], ['group2', 'type2', '2022-05-21']]

Answer 2

@Ajax 答案很好，但为了完整起见，我添加了带有 groupby:

的版本

lst = [
    ["group1", "type1", "2021-3-24"],
    ["group1", "type1", "2021-3-25"],
    ["group1", "type1", "2021-3-26"],
    ["group2", "type2", "2022-5-21"],
    ["group2", "type2", "2021-1-12"],
    ["group2", "type2", "2021-3-26"],
]

from itertools import groupby

out = []
# if list is not sorted:
# lst = sorted(lst, key=lambda k: (k[0], k[1]))
for c, g in groupby(lst, lambda k: (k[0], k[1])):
    out.append(
        [*c, "-".join(map(str, max([*map(int, v[-1].split("-"))] for v in g)))]
    )

print(out)

打印：

[['group1', 'type1', '2021-3-26'], ['group2', 'type2', '2022-5-21']]

列出最大日期分组依据

List group by on max date

python

list

itertools