如何从数组 Collection 中获取统计信息? (最小最大值和平均值)
How to get statistics from an array Collection ? (min max, and average)
我有一个包含长二维数组的文本文件,如下所示:
[[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585] ..... [6, 47]]
每一个的第一个元素的数字在1到7之间。我想读取所有的第二个元素的信息,并分别找到1到7之间的每个组的最大和最小数量。例如这样的输出:
Mix for first element with 1: 500
Max for first element with 1: 2
average: 251
Min for row with 2: 0
Max for row with 2: 5000
average: 2500
and so on
根据数组的第一个元素分组获取最小值、最大值和平均值的最有效方法是什么?
file = open("myfile.txt", "r")
list_of_lists = file.read()
unique_values = set([list[1] for list in list_of_lists])
group_list = [[list[0] for list in list_of_lists if list[1] == value] for value in unique_values]
print(group_list)
您可以维护一个字典,将组 ID(整数)映射到大小为 2 的列表(一个条目用于组的最小值,一个用于组的最大值)。要提取这些值,请遍历列表。值得注意的是,这种方法不需要使用任何重依赖性,如 numpy
或 pandas
。这也不需要排序,因此它渐近地运行得更快:O(n)
用于我的方法与 O(n log n)
用于排序:
data = {}
for group, entry in items:
if group not in data:
data[group] = [entry, entry]
else:
current_min, current_max = data[group]
data[group] = [min(entry, current_min), max(entry, current_max)]
for key in data:
print(f"Min for row with {key}: {data[key][0]}")
print(f"Max for row with {key}: {data[key][1]}")
这输出:
Min for row with 1: 2
Max for row with 1: 500
Min for row with 5: 585
Max for row with 5: 906
Min for row with 2: 0
Max for row with 2: 5000
Min for row with 3: 28
Max for row with 3: 89
Min for row with 4: 28
Max for row with 4: 163
Min for row with 6: 47
Max for row with 6: 358
Min for row with 7: 258
Max for row with 7: 585
使用pandas
data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668],
[3, 54], [4, 28], [3, 28], [4,163], [3,85],
[5,906], [2,5000], [6,358], [4,69], [3,89],
[7, 258],[5, 632], [7, 585]]
grp = []
col = []
for k, v in data:
grp.append(k)
col.append(v)
pd.Series(col).groupby(grp).agg(["min", "mean", "max"])
# min mean max
# 1 2 251.000000 500
# 2 0 1889.333333 5000
# 3 28 64.000000 89
# 4 28 86.666667 163
# 5 585 707.666667 906
# 6 358 358.000000 358
# 7 258 421.500000 585
为此我们可以使用 pandas
:
import numpy as np
import pandas as pd
file_data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585], [6, 47]]
file_data = np.array(file_data)
df = pd.DataFrame(data = {'num': file_data[:, 0], 'data': file_data[:, 1]})
for i in np.sort(df['num'].unique()):
print('Min for', i, ':', df.loc[df['num'] == i, 'data'].min())
print('Max for', i, ':', df.loc[df['num'] == i, 'data'].max())
temp_df = df.loc[df['num'] == i, 'data']
print("Average for", i, ":", temp_df.sum()/len(temp_df.index))
这给了我们:
Min for 1 : 2
Max for 1 : 500
Average for 1 : 251.0
Min for 2 : 0
Max for 2 : 5000
Average for 2 : 1889.3333333333333
Min for 3 : 28
Max for 3 : 89
Average for 3 : 64.0
Min for 4 : 28
Max for 4 : 163
Average for 4 : 86.66666666666667
Min for 5 : 585
Max for 5 : 906
Average for 5 : 707.6666666666666
Min for 6 : 47
Max for 6 : 358
Average for 6 : 202.5
Min for 7 : 258
Max for 7 : 585
Average for 7 : 421.5
您可以按每个项目的第一个元素对 list-of-lists 排序,然后 groupby
相同的元素。
import itertools
l = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585]]
l.sort(key=lambda item: item[0])
groups = { k: [item[1] for item in v]
for k, v in itertools.groupby(l, key=lambda item: item[0])}
Why do I need to sort first?
这给出 groups=
{1: [2, 500],
2: [0, 668, 5000],
3: [54, 28, 85, 89],
4: [28, 163, 69],
5: [585, 906, 632],
6: [358],
7: [258, 585]}
groups = ...
行的解释:
- 首先,我
groupby()
使用每个项目的第一个元素作为键的排序列表。这会将具有相同键的所有元素分组到一个可迭代对象中,因此对于 1
的键,我们将有一个包含元素 [1, 2]
和 [1, 500]
. 的可迭代对象
- 我迭代了这个
groupby()
结果,并使用字典理解创建了一个字典
- 字典的键,我用的是
groupby()
的键
- 对于 dict 的值,我有一个列表理解,它遍历组中的每个项目,并且只采用该项目的第一个元素(因此
1
键现在将具有一个值是一个包含 2
和 500
) 的列表。
然后,只用每一项的第二个元素,求出每组的max
和min
:
max_vals = {k: max(v) for k, v in groups.items()}
# {1: 500, 2: 5000, 3: 89, 4: 163, 5: 906, 6: 358, 7: 585}
min_vals = {k: min(v) for k, v in groups.items()}
# {1: 2, 2: 0, 3: 28, 4: 28, 5: 585, 6: 358, 7: 258}
avg_vals = {k: sum(v) / len(v) for k, v in groups.items()}
# {1: 251.0, 2: 1889.3333333333333, 3: 64.0, 4: 86.66666666666667, 5: 707.6666666666666, 6: 358.0, 7: 421.5}
或者,按照您想要的方式打印它们:
for k, v in groups.items():
print(f"Max for first element with {k}: {max(v)}")
print(f"Min for first element with {k}: {min(v)}")
print(f"Average: {sum(v) / len(v)}")
给出:
Max for first element with 1: 500
Min for first element with 1: 2
Average: 251.0
Max for first element with 2: 5000
Min for first element with 2: 0
Average: 1889.3333333333333
Max for first element with 3: 89
Min for first element with 3: 28
Average: 64.0
... and so on
我有一个包含长二维数组的文本文件,如下所示:
[[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585] ..... [6, 47]]
每一个的第一个元素的数字在1到7之间。我想读取所有的第二个元素的信息,并分别找到1到7之间的每个组的最大和最小数量。例如这样的输出:
Mix for first element with 1: 500
Max for first element with 1: 2
average: 251
Min for row with 2: 0
Max for row with 2: 5000
average: 2500
and so on
根据数组的第一个元素分组获取最小值、最大值和平均值的最有效方法是什么?
file = open("myfile.txt", "r")
list_of_lists = file.read()
unique_values = set([list[1] for list in list_of_lists])
group_list = [[list[0] for list in list_of_lists if list[1] == value] for value in unique_values]
print(group_list)
您可以维护一个字典,将组 ID(整数)映射到大小为 2 的列表(一个条目用于组的最小值,一个用于组的最大值)。要提取这些值,请遍历列表。值得注意的是,这种方法不需要使用任何重依赖性,如 numpy
或 pandas
。这也不需要排序,因此它渐近地运行得更快:O(n)
用于我的方法与 O(n log n)
用于排序:
data = {}
for group, entry in items:
if group not in data:
data[group] = [entry, entry]
else:
current_min, current_max = data[group]
data[group] = [min(entry, current_min), max(entry, current_max)]
for key in data:
print(f"Min for row with {key}: {data[key][0]}")
print(f"Max for row with {key}: {data[key][1]}")
这输出:
Min for row with 1: 2
Max for row with 1: 500
Min for row with 5: 585
Max for row with 5: 906
Min for row with 2: 0
Max for row with 2: 5000
Min for row with 3: 28
Max for row with 3: 89
Min for row with 4: 28
Max for row with 4: 163
Min for row with 6: 47
Max for row with 6: 358
Min for row with 7: 258
Max for row with 7: 585
使用pandas
data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668],
[3, 54], [4, 28], [3, 28], [4,163], [3,85],
[5,906], [2,5000], [6,358], [4,69], [3,89],
[7, 258],[5, 632], [7, 585]]
grp = []
col = []
for k, v in data:
grp.append(k)
col.append(v)
pd.Series(col).groupby(grp).agg(["min", "mean", "max"])
# min mean max
# 1 2 251.000000 500
# 2 0 1889.333333 5000
# 3 28 64.000000 89
# 4 28 86.666667 163
# 5 585 707.666667 906
# 6 358 358.000000 358
# 7 258 421.500000 585
为此我们可以使用 pandas
:
import numpy as np
import pandas as pd
file_data = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585], [6, 47]]
file_data = np.array(file_data)
df = pd.DataFrame(data = {'num': file_data[:, 0], 'data': file_data[:, 1]})
for i in np.sort(df['num'].unique()):
print('Min for', i, ':', df.loc[df['num'] == i, 'data'].min())
print('Max for', i, ':', df.loc[df['num'] == i, 'data'].max())
temp_df = df.loc[df['num'] == i, 'data']
print("Average for", i, ":", temp_df.sum()/len(temp_df.index))
这给了我们:
Min for 1 : 2
Max for 1 : 500
Average for 1 : 251.0
Min for 2 : 0
Max for 2 : 5000
Average for 2 : 1889.3333333333333
Min for 3 : 28
Max for 3 : 89
Average for 3 : 64.0
Min for 4 : 28
Max for 4 : 163
Average for 4 : 86.66666666666667
Min for 5 : 585
Max for 5 : 906
Average for 5 : 707.6666666666666
Min for 6 : 47
Max for 6 : 358
Average for 6 : 202.5
Min for 7 : 258
Max for 7 : 585
Average for 7 : 421.5
您可以按每个项目的第一个元素对 list-of-lists 排序,然后 groupby
相同的元素。
import itertools
l = [[1, 2], [5,585], [2, 0], [1, 500], [2, 668], [3, 54], [4, 28], [3, 28], [4,163], [3,85], [5,906], [2,5000], [6,358], [4,69], [3,89], [7, 258],[5, 632], [7, 585]]
l.sort(key=lambda item: item[0])
groups = { k: [item[1] for item in v]
for k, v in itertools.groupby(l, key=lambda item: item[0])}
Why do I need to sort first?
这给出 groups=
{1: [2, 500],
2: [0, 668, 5000],
3: [54, 28, 85, 89],
4: [28, 163, 69],
5: [585, 906, 632],
6: [358],
7: [258, 585]}
groups = ...
行的解释:
- 首先,我
groupby()
使用每个项目的第一个元素作为键的排序列表。这会将具有相同键的所有元素分组到一个可迭代对象中,因此对于1
的键,我们将有一个包含元素[1, 2]
和[1, 500]
. 的可迭代对象
- 我迭代了这个
groupby()
结果,并使用字典理解创建了一个字典- 字典的键,我用的是
groupby()
的键
- 对于 dict 的值,我有一个列表理解,它遍历组中的每个项目,并且只采用该项目的第一个元素(因此
1
键现在将具有一个值是一个包含2
和500
) 的列表。
- 字典的键,我用的是
然后,只用每一项的第二个元素,求出每组的max
和min
:
max_vals = {k: max(v) for k, v in groups.items()}
# {1: 500, 2: 5000, 3: 89, 4: 163, 5: 906, 6: 358, 7: 585}
min_vals = {k: min(v) for k, v in groups.items()}
# {1: 2, 2: 0, 3: 28, 4: 28, 5: 585, 6: 358, 7: 258}
avg_vals = {k: sum(v) / len(v) for k, v in groups.items()}
# {1: 251.0, 2: 1889.3333333333333, 3: 64.0, 4: 86.66666666666667, 5: 707.6666666666666, 6: 358.0, 7: 421.5}
或者,按照您想要的方式打印它们:
for k, v in groups.items():
print(f"Max for first element with {k}: {max(v)}")
print(f"Min for first element with {k}: {min(v)}")
print(f"Average: {sum(v) / len(v)}")
给出:
Max for first element with 1: 500
Min for first element with 1: 2
Average: 251.0
Max for first element with 2: 5000
Min for first element with 2: 0
Average: 1889.3333333333333
Max for first element with 3: 89
Min for first element with 3: 28
Average: 64.0
... and so on