给定大量元组,如何对每个元组的第一个元素进行分组,以便在没有 Pandas 数据帧的情况下对每个元组的最后一个元素求和?

Given a large array of tuples, how to groupby the first element of each tuple in order to sum the last element of each tuple without Pandas dataframe?

我有一个很大的元组列表,其中每个元组包含 9 个字符串元素:

pdf_results = [
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/18/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/18/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/19/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/19/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/20/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/20/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/21/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/21/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/23/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/24/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/24/22', 'SMI', '5', '0', '10', '5'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/18/22', 'RC', '8', '0', '16', '8'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/18/22', 'SMI', '5', '0', '10', '5'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/19/22', 'RC', '8', '0', '16', '8'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/19/22', 'SMI', '5', '0', '10', '5')
]

在不使用 Pandas 数据框的情况下,如何最好地按每个元组的第一个元素进行分组以便对每个元组的最后一个元素求和。输出应如下所示:

desired_output = [
("Kohl's - Dallas", 70),
("Bronx-Lebanon Hospital Center", 26)
]

我试过使用 itertools.groupby 这似乎是最合适的解决方案;然而,在没有 运行 的情况下正确地迭代、索引和总结每个元组的最后一个元素陷入以下障碍之一:

  1. 每个元组的最后一个元素是 string 类型,并且在转换为 int 时阻止迭代,因为 TypeError: 'int' object not iterable
  2. ValueErrorinvalid literal for int() with base 10: 'b'

尝试:

from itertools import groupby

def getSiteName(siteChunk):
    return siteChunk[0]

siteNameGroup = groupby(pdf_results, getSiteName)

for key, group in siteNameGroup:
    print(key) # 1st element of tuple as desired
    for pdf_results in group:
        # Raises TypeError: unsupported operand type(s) for +: 'int' and 'str'
        print(sum(pdf_results[8]))
    print()

为什么不在空字典上使用简单的 for 循环?

resultDict = {}
for value in pdf_results:
  if value[0] not in resultDict:
    resultDict[value[0]] = 0
  resultDict[value[0]] += float(value[len(value)-1])
print(resultDict)

输出

{"Kohl's - Dallas": 70.0,
'Bronx-Lebanon Hospital Center': 26.0}

如果字典不是您想要的,而您坚持使用元组,您可以使用:

list(resultDict.items())

输出

[("Kohl's - Dallas", 70.0), ('Bronx-Lebanon Hospital Center', 26.0)]

假设您的列表按第一个元素排序,您可以这样做:

from itertools import groupby 

for k,v in groupby(pdf_results, key=lambda t: t[0]):
    print(k, sum(int(x[-1]) for x in v))

打印:

Kohl's - Dallas 70
Bronx-Lebanon Hospital Center 26

如果顺序未排序,只需使用 dict 来汇总由元组的第一个条目键入的元素:

res={}

for t in pdf_results:
    res[t[0]]=res.get(t[0],0)+int(t[-1])

>>> res
{"Kohl's - Dallas": 70, 'Bronx-Lebanon Hospital Center': 26}

你快到了。只需更改您的

for pdf_results in group:
    print(sum(pdf_results[8]))

至:

print(sum(int(pdf_results[8])
          for pdf_results in group))

(虽然我也会重命名为 pdf_result,单数。)

这也行得通:

from collections import defaultdict

output = defaultdict(int)

for item in pdf_results:
    output[item[0]] += int(item[-1])

print(list(output.items()))

输出

[("Kohl's - Dallas", 70), ('Bronx-Lebanon Hospital Center', 26)]