Python:根据前 2 个内部列表值删除列表重复项
Python: Removing list duplicates based on first 2 inner list values
问题:
我有以下格式的列表:
x = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
算法:
- 将所有 内部列表与 相同的起始 2 个值 合并,第三个值不必相同即可合并它们
- 例如
"hello",0,5
是与"hello",0,8
组合
- 但不与
"hello",1,1
结合
- 第三个值成为第三个值的平均值:
sum(all 3rd vals) / len(all 3rd vals)
- 注意:
all 3rd vals
我指的是每个内部重复项列表的第 3 个值
- 例如
"hello",0,5
和 "hello",0,8
变为 hello,0,6.5
期望的输出:(列表的顺序无关紧要)
x = [["hello",0,6.5], ["hi",0,6], ["hello",1,1]]
问题:
- 如何在 Python 中实现这个算法?
理想情况下它会很有效,因为这将用于非常大的列表。
如果有任何不清楚的地方,请告诉我,我会解释。
编辑: 我试图将列表更改为一个集合以删除重复项,但这并没有考虑内部列表中的第三个变量,因此没有工作。
解决方案性能:
Thanks to everyone who has provided a solution to this problem! Here
are the results based on a speed test of all the functions:
使用 运行 求和和计数进行更新
我想出了如何改进我以前的代码(见下面的原文)。您可以保留 运行 个总数和计数,然后在最后计算平均值,这样可以避免记录所有单独的数字。
from collections import defaultdict
class RunningAverage:
def __init__(self):
self.total = 0
self.count = 0
def add(self, value):
self.total += value
self.count += 1
def calculate(self):
return self.total / self.count
def func(lst):
thirds = defaultdict(RunningAverage)
for sub in lst:
k = tuple(sub[:2])
thirds[k].add(sub[2])
lst_out = [[*k, v.calculate()] for k, v in thirds.items()]
return lst_out
print(func(x)) # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
原回答
这可能不会很有效,因为它必须累加所有值才能对它们进行平均。我认为您可以通过考虑权重的 运行 平均值来解决这个问题,但我不太确定该怎么做。
from collections import defaultdict
def avg(nums):
return sum(nums) / len(nums)
def func(lst):
thirds = defaultdict(list)
for sub in lst:
k = tuple(sub[:2])
thirds[k].append(sub[2])
lst_out = [[*k, avg(v)] for k, v in thirds.items()]
return lst_out
print(func(x)) # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
您可以尝试使用 groupby
。
m = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
from itertools import groupby
m.sort(key=lambda x:x[0]+str(x[1]))
for i,j in groupby(m, lambda x:x[0]+str(x[1])):
ss=0
c=0.0
for k in j:
ss+=k[2]
c+=1.0
print [k[0], k[1], ss/c]
这应该是O(N),如果我错了有人纠正我:
def my_algorithm(input_list):
"""
:param input_list: list of lists in format [string, int, int]
:return: list
"""
# Dict in format (string, int): [int, count_int]
# So our list is in this format, example:
# [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
# so for our dict we will make keys a tuple of the first 2 values of each sublist (since that needs to be unique)
# while values are a list of third element from our sublist + counter (which counts every time we have a duplicate
# key, so we can divide it and get average).
my_dict = {}
for element in input_list:
# key is a tuple of the first 2 values of each sublist
key = (element[0], element[1])
if key not in my_dict:
# If the key do not exists add it.
# Value is in form of third element from our sublist + counter. Since this is first value set counter to 1
my_dict[key] = [element[2], 1]
else:
# If key does exist then increment our value and increment counter by 1
my_dict[key][0] += element[2]
my_dict[key][1] += 1
# we have a dict so we will need to convert it to list (and on the way calculate averages)
return _convert_my_dict_to_list(my_dict)
def _convert_my_dict_to_list(my_dict):
"""
:param my_dict: dict, key is in form of tuple (string, int) and values are in form of list [int, int_counter]
:return: list
"""
my_list = []
for key, value in my_dict.items():
sublist = [key[0], key[1], value[0]/value[1]]
my_list.append(sublist)
return my_list
my_algorithm(x)
这将 return:
[['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
而您的预期 return 是:
[["hello", 0, 6.5], ["hi", 0, 6], ["hello", 1, 1]]
如果你真的需要整数那么你可以修改_convert_my_dict_to_list
函数。
这是我对这个主题的变体:groupby
没有昂贵的 sort
。我还更改了问题,使输入和输出成为 元组列表 ,因为这些是固定大小的记录:
from itertools import groupby
from operator import itemgetter
from collections import defaultdict
data = [("hello", 0, 5), ("hi", 0, 6), ("hello", 0, 8), ("hello", 1, 1)]
dictionary = defaultdict(complex)
for key, group in groupby(data, itemgetter(slice(2))):
total = sum(value for (string, number, value) in group)
dictionary[key] += total + 1j
array = [(*key, value.real / value.imag) for key, value in dictionary.items()]
print(array)
输出
> python3 test.py
[('hello', 0, 6.5), ('hi', 0, 6.0), ('hello', 1, 1.0)]
>
感谢@wjandrea itemgetter
替代 lambda
。 (是的,我 am 使用 complex
数字作为平均值来跟踪总数和计数。)
问题:
我有以下格式的列表:
x = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
算法:
- 将所有 内部列表与 相同的起始 2 个值 合并,第三个值不必相同即可合并它们
- 例如
"hello",0,5
是与"hello",0,8
组合
- 但不与
"hello",1,1
结合
- 例如
- 第三个值成为第三个值的平均值:
sum(all 3rd vals) / len(all 3rd vals)
- 注意:
all 3rd vals
我指的是每个内部重复项列表的第 3 个值 - 例如
"hello",0,5
和"hello",0,8
变为hello,0,6.5
- 注意:
期望的输出:(列表的顺序无关紧要)
x = [["hello",0,6.5], ["hi",0,6], ["hello",1,1]]
问题:
- 如何在 Python 中实现这个算法?
理想情况下它会很有效,因为这将用于非常大的列表。
如果有任何不清楚的地方,请告诉我,我会解释。
编辑: 我试图将列表更改为一个集合以删除重复项,但这并没有考虑内部列表中的第三个变量,因此没有工作。
解决方案性能:
Thanks to everyone who has provided a solution to this problem! Here are the results based on a speed test of all the functions:
使用 运行 求和和计数进行更新
我想出了如何改进我以前的代码(见下面的原文)。您可以保留 运行 个总数和计数,然后在最后计算平均值,这样可以避免记录所有单独的数字。
from collections import defaultdict
class RunningAverage:
def __init__(self):
self.total = 0
self.count = 0
def add(self, value):
self.total += value
self.count += 1
def calculate(self):
return self.total / self.count
def func(lst):
thirds = defaultdict(RunningAverage)
for sub in lst:
k = tuple(sub[:2])
thirds[k].add(sub[2])
lst_out = [[*k, v.calculate()] for k, v in thirds.items()]
return lst_out
print(func(x)) # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
原回答
这可能不会很有效,因为它必须累加所有值才能对它们进行平均。我认为您可以通过考虑权重的 运行 平均值来解决这个问题,但我不太确定该怎么做。
from collections import defaultdict
def avg(nums):
return sum(nums) / len(nums)
def func(lst):
thirds = defaultdict(list)
for sub in lst:
k = tuple(sub[:2])
thirds[k].append(sub[2])
lst_out = [[*k, avg(v)] for k, v in thirds.items()]
return lst_out
print(func(x)) # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
您可以尝试使用 groupby
。
m = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
from itertools import groupby
m.sort(key=lambda x:x[0]+str(x[1]))
for i,j in groupby(m, lambda x:x[0]+str(x[1])):
ss=0
c=0.0
for k in j:
ss+=k[2]
c+=1.0
print [k[0], k[1], ss/c]
这应该是O(N),如果我错了有人纠正我:
def my_algorithm(input_list):
"""
:param input_list: list of lists in format [string, int, int]
:return: list
"""
# Dict in format (string, int): [int, count_int]
# So our list is in this format, example:
# [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
# so for our dict we will make keys a tuple of the first 2 values of each sublist (since that needs to be unique)
# while values are a list of third element from our sublist + counter (which counts every time we have a duplicate
# key, so we can divide it and get average).
my_dict = {}
for element in input_list:
# key is a tuple of the first 2 values of each sublist
key = (element[0], element[1])
if key not in my_dict:
# If the key do not exists add it.
# Value is in form of third element from our sublist + counter. Since this is first value set counter to 1
my_dict[key] = [element[2], 1]
else:
# If key does exist then increment our value and increment counter by 1
my_dict[key][0] += element[2]
my_dict[key][1] += 1
# we have a dict so we will need to convert it to list (and on the way calculate averages)
return _convert_my_dict_to_list(my_dict)
def _convert_my_dict_to_list(my_dict):
"""
:param my_dict: dict, key is in form of tuple (string, int) and values are in form of list [int, int_counter]
:return: list
"""
my_list = []
for key, value in my_dict.items():
sublist = [key[0], key[1], value[0]/value[1]]
my_list.append(sublist)
return my_list
my_algorithm(x)
这将 return:
[['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
而您的预期 return 是:
[["hello", 0, 6.5], ["hi", 0, 6], ["hello", 1, 1]]
如果你真的需要整数那么你可以修改_convert_my_dict_to_list
函数。
这是我对这个主题的变体:groupby
没有昂贵的 sort
。我还更改了问题,使输入和输出成为 元组列表 ,因为这些是固定大小的记录:
from itertools import groupby
from operator import itemgetter
from collections import defaultdict
data = [("hello", 0, 5), ("hi", 0, 6), ("hello", 0, 8), ("hello", 1, 1)]
dictionary = defaultdict(complex)
for key, group in groupby(data, itemgetter(slice(2))):
total = sum(value for (string, number, value) in group)
dictionary[key] += total + 1j
array = [(*key, value.real / value.imag) for key, value in dictionary.items()]
print(array)
输出
> python3 test.py
[('hello', 0, 6.5), ('hi', 0, 6.0), ('hello', 1, 1.0)]
>
感谢@wjandrea itemgetter
替代 lambda
。 (是的,我 am 使用 complex
数字作为平均值来跟踪总数和计数。)