Python：根据前 2 个内部列表值删除列表重复项

Question

问题：

我有以下格式的列表：

x = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]

算法：

将所有 内部列表与 相同的起始 2 个值 合并，第三个值不必相同即可合并它们
- 例如"hello",0,5是与"hello",0,8
- 但不与"hello",1,1
第三个值成为第三个值的平均值：sum(all 3rd vals) / len(all 3rd vals)
- 注意：all 3rd vals 我指的是每个内部重复项列表的第 3 个值
- 例如"hello",0,5 和 "hello",0,8 变为 hello,0,6.5

期望的输出：（列表的顺序无关紧要）

x = [["hello",0,6.5], ["hi",0,6], ["hello",1,1]]

问题：

如何在 Python 中实现这个算法？

理想情况下它会很有效，因为这将用于非常大的列表。

如果有任何不清楚的地方，请告诉我，我会解释。

编辑： 我试图将列表更改为一个集合以删除重复项，但这并没有考虑内部列表中的第三个变量，因此没有工作。

解决方案性能：

Thanks to everyone who has provided a solution to this problem! Here are the results based on a speed test of all the functions:

Answer 1

使用运行求和和计数进行更新

我想出了如何改进我以前的代码（见下面的原文）。您可以保留运行个总数和计数，然后在最后计算平均值，这样可以避免记录所有单独的数字。

from collections import defaultdict

class RunningAverage:
    def __init__(self):
        self.total = 0
        self.count = 0

    def add(self, value):
        self.total += value
        self.count += 1

    def calculate(self):
        return self.total / self.count

def func(lst):
    thirds = defaultdict(RunningAverage)
    for sub in lst:
        k = tuple(sub[:2])
        thirds[k].add(sub[2])
    lst_out = [[*k, v.calculate()] for k, v in thirds.items()]
    return lst_out

print(func(x))  # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

原回答

这可能不会很有效，因为它必须累加所有值才能对它们进行平均。我认为您可以通过考虑权重的运行平均值来解决这个问题，但我不太确定该怎么做。

from collections import defaultdict

def avg(nums):
    return sum(nums) / len(nums)

def func(lst):
    thirds = defaultdict(list)
    for sub in lst:
        k = tuple(sub[:2])
        thirds[k].append(sub[2])
    lst_out = [[*k, avg(v)] for k, v in thirds.items()]
    return lst_out

print(func(x))  # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

Answer 2

您可以尝试使用 groupby。

m = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
from itertools import groupby
m.sort(key=lambda x:x[0]+str(x[1]))

for i,j in groupby(m, lambda x:x[0]+str(x[1])):
    ss=0
    c=0.0
    for k in j:
        ss+=k[2]
        c+=1.0
    print [k[0], k[1], ss/c]

Answer 3

这应该是O(N)，如果我错了有人纠正我：

def my_algorithm(input_list):
    """
    :param input_list: list of lists in format [string, int, int]
    :return: list
    """

    # Dict in format (string, int): [int, count_int]
    # So our list is in this format, example:
    # [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
    # so for our dict we will make keys a tuple of the first 2 values of each sublist (since that needs to be unique)
    # while values are a list of third element from our sublist + counter (which counts every time we have a duplicate
    # key, so we can divide it and get average).
    my_dict = {}
    for element in input_list:
        # key is a tuple of the first 2 values of each sublist
        key = (element[0], element[1])
        if key not in my_dict:
            # If the key do not exists add it.
            # Value is in form of third element from our sublist + counter. Since this is first value set counter to 1
            my_dict[key] = [element[2], 1]
        else:
            # If key does exist then increment our value and increment counter by 1
            my_dict[key][0] += element[2]
            my_dict[key][1] += 1

    # we have a dict so we will need to convert it to list (and on the way calculate averages)
    return _convert_my_dict_to_list(my_dict)


def _convert_my_dict_to_list(my_dict):
    """
    :param my_dict: dict, key is in form of tuple (string, int) and values are in form of list [int, int_counter]
    :return: list
    """
    my_list = []
    for key, value in my_dict.items():
        sublist = [key[0], key[1], value[0]/value[1]]
        my_list.append(sublist)
    return my_list

my_algorithm(x)

这将 return:

[['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

而您的预期 return 是：

[["hello", 0, 6.5], ["hi", 0, 6], ["hello", 1, 1]]

如果你真的需要整数那么你可以修改_convert_my_dict_to_list函数。

Answer 4

这是我对这个主题的变体：groupby 没有昂贵的 sort。我还更改了问题，使输入和输出成为 元组列表 ，因为这些是固定大小的记录：

from itertools import groupby
from operator import itemgetter
from collections import defaultdict

data = [("hello", 0, 5), ("hi", 0, 6), ("hello", 0, 8), ("hello", 1, 1)]

dictionary = defaultdict(complex)

for key, group in groupby(data, itemgetter(slice(2))):
    total = sum(value for (string, number, value) in group)
    dictionary[key] += total + 1j

array = [(*key, value.real / value.imag) for key, value in dictionary.items()]

print(array)

输出

> python3 test.py
[('hello', 0, 6.5), ('hi', 0, 6.0), ('hello', 1, 1.0)]
>

感谢@wjandrea itemgetter 替代 lambda。（是的，我 am 使用 complex 数字作为平均值来跟踪总数和计数。）

Python：根据前 2 个内部列表值删除列表重复项

Python: Removing list duplicates based on first 2 inner list values

python

processing-efficiency

python-3.x

问题：

解决方案性能：

使用运行求和和计数进行更新

原回答

Python：根据前 2 个内部列表值删除列表重复项

Python: Removing list duplicates based on first 2 inner list values

python

processing-efficiency

python-3.x

问题：

解决方案性能：

使用 运行 求和和计数进行更新

原回答

使用运行求和和计数进行更新