如何从数组 Collection 中获取统计信息? (最小最大值和平均值)

How to get statistics from an array Collection ? (min max, and average)

我有一个包含长二维数组的文本文件,如下所示:

[[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585] ..... [6, 47]]

每一个的第一个元素的数字在1到7之间。我想读取所有的第二个元素的信息,并分别找到1到7之间的每个组的最大和最小数量。例如这样的输出:

Mix for first element with 1: 500  
Max for first element with 1: 2
average: 251

Min for row with 2:  0
Max for row with 2:  5000
average: 2500

and so on 

根据数组的第一个元素分组获取最小值、最大值和平均值的最有效方法是什么?

file = open("myfile.txt", "r")
list_of_lists = file.read()

unique_values = set([list[1] for list in list_of_lists])
group_list = [[list[0] for list in list_of_lists if list[1] == value] for value in unique_values]

print(group_list) 

您可以维护一个字典,将组 ID(整数)映射到大小为 2 的列表(一个条目用于组的最小值,一个用于组的最大值)。要提取这些值,请遍历列表。值得注意的是,这种方法不需要使用任何重依赖性,如 numpypandas。这也不需要排序,因此它渐近地运行得更快:O(n) 用于我的方法与 O(n log n) 用于排序:

data = {}

for group, entry in items:
    if group not in data:
        data[group] = [entry, entry]
    else:
        current_min, current_max = data[group]
        data[group] = [min(entry, current_min), max(entry, current_max)]

for key in data:
    print(f"Min for row with {key}: {data[key][0]}")
    print(f"Max for row with {key}: {data[key][1]}")

这输出:

Min for row with 1: 2
Max for row with 1: 500
Min for row with 5: 585
Max for row with 5: 906
Min for row with 2: 0
Max for row with 2: 5000
Min for row with 3: 28
Max for row with 3: 89
Min for row with 4: 28
Max for row with 4: 163
Min for row with 6: 47
Max for row with 6: 358
Min for row with 7: 258
Max for row with 7: 585

使用pandas

data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668],
        [3, 54], [4, 28], [3, 28], [4,163], [3,85],
        [5,906], [2,5000], [6,358], [4,69], [3,89],
        [7, 258],[5, 632], [7, 585]]

grp = []
col = []
for k, v in data:
    grp.append(k)
    col.append(v)

pd.Series(col).groupby(grp).agg(["min", "mean", "max"])
#    min         mean   max
# 1    2   251.000000   500
# 2    0  1889.333333  5000
# 3   28    64.000000    89
# 4   28    86.666667   163
# 5  585   707.666667   906
# 6  358   358.000000   358
# 7  258   421.500000   585

为此我们可以使用 pandas:

import numpy as np
import pandas as pd

file_data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585], [6, 47]]

file_data = np.array(file_data)

df = pd.DataFrame(data = {'num': file_data[:, 0], 'data': file_data[:, 1]})

for i in np.sort(df['num'].unique()):
    print('Min for', i, ':', df.loc[df['num'] == i, 'data'].min())
    print('Max for', i, ':', df.loc[df['num'] == i, 'data'].max())
    temp_df = df.loc[df['num'] == i, 'data']
    print("Average for", i, ":", temp_df.sum()/len(temp_df.index))

这给了我们:

Min for 1 : 2
Max for 1 : 500
Average for 1 : 251.0
Min for 2 : 0
Max for 2 : 5000
Average for 2 : 1889.3333333333333
Min for 3 : 28
Max for 3 : 89
Average for 3 : 64.0
Min for 4 : 28
Max for 4 : 163
Average for 4 : 86.66666666666667
Min for 5 : 585
Max for 5 : 906
Average for 5 : 707.6666666666666
Min for 6 : 47
Max for 6 : 358
Average for 6 : 202.5
Min for 7 : 258
Max for 7 : 585
Average for 7 : 421.5

您可以按每个项目的第一个元素对 list-of-lists 排序,然后 groupby 相同的元素。

import itertools
l = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585]]

l.sort(key=lambda item: item[0])
groups = { k: [item[1] for item in v] 
           for k, v in itertools.groupby(l, key=lambda item: item[0])}

Why do I need to sort first?

这给出 groups=

{1: [2, 500],
 2: [0, 668, 5000],
 3: [54, 28, 85, 89],
 4: [28, 163, 69],
 5: [585, 906, 632],
 6: [358],
 7: [258, 585]}

groups = ... 行的解释:

  • 首先,我 groupby() 使用每个项目的第一个元素作为键的排序列表。这会将具有相同键的所有元素分组到一个可迭代对象中,因此对于 1 的键,我们将有一个包含元素 [1, 2][1, 500].
  • 的可迭代对象
  • 我迭代了这个 groupby() 结果,并使用字典理解创建了一个字典
    • 字典的键,我用的是groupby()
    • 的键
    • 对于 dict 的值,我有一个列表理解,它遍历组中的每个项目,并且只采用该项目的第一个元素(因此 1 键现在将具有一个值是一个包含 2500) 的列表。

然后,只用每一项的第二个元素,求出每组的maxmin

max_vals = {k: max(v) for k, v in groups.items()}
# {1: 500, 2: 5000, 3: 89, 4: 163, 5: 906, 6: 358, 7: 585}

min_vals = {k: min(v) for k, v in groups.items()}
# {1: 2, 2: 0, 3: 28, 4: 28, 5: 585, 6: 358, 7: 258}

avg_vals = {k: sum(v) / len(v) for k, v in groups.items()}
# {1: 251.0,  2: 1889.3333333333333,  3: 64.0,  4: 86.66666666666667,  5: 707.6666666666666,  6: 358.0,  7: 421.5}

或者,按照您想要的方式打印它们:

for k, v in groups.items():
    print(f"Max for first element with {k}: {max(v)}")  
    print(f"Min for first element with {k}: {min(v)}")
    print(f"Average: {sum(v) / len(v)}")

给出:

Max for first element with 1: 500
Min for first element with 1: 2
Average: 251.0
Max for first element with 2: 5000
Min for first element with 2: 0
Average: 1889.3333333333333
Max for first element with 3: 89
Min for first element with 3: 28
Average: 64.0
... and so on