如何计算特定行的列的最小值和最大值?

How to calculate min and max of a column for particular rows?

我有一个 csv 文件如下:

    0  2  1  1    464  385  171    0:44:4   
    1  1  2  26    254  444  525    0:56:2   
    2  3  1  90    525  785  522    0:52:8   
    3  8  2  3    525  233  555    0:52:8  
    4  7  1  10    525  433  522    1:52:8   
    5  9  2  55     525  555  522    1:52:8   
    6  6  3  3   392  111  232    1:43:4   
    7  1  4  23    322  191  112    1:43:4   
    8  1  3  30    322  191  112    1:43:4   
    9  1  5   2   322  191  112    1:43:4   
    10  1  3  22   322  191  112    1:43:4   
    11  1  4  44   322  191  112    1:43:4   
    12  1  5  1   322  191  112    1:43:4   
    12  1  4  3    322  191  112    1:43:4   
    12  1  6  33    322  191  112    1:43:4   
    12  1  6  1    322  191  112    1:43:4 
    12  1  5  3    322  191  112    1:43:4   
    12  1  6  33    322  191  112    1:43:4   
     

    .
    .

第三列的数字在 1 到 6 之间。我想读取第三列中数字为 1 到 6 的所有行的第 4 列和第 5 列的信息,并找出每行的最大和最小数量分别有 1 到 6 号。例如这样的输出:

Mix for row with 1: 1  
Max for row with 1: 90

Min for row with 2:  3
Max for row with 2:  55

and so on 

我可以使用以下代码绘制图形。如何按组获取汇总统计信息?我正在寻找的是在一次调用中获取同一组的多个统计信息,如平均值、最小值、最大值、每个组的数量,这可行吗?

import matplotlib.pyplot as plt
import csv
x= []
y= []

with open('mydata.csv','r') as csvfile:
    ap = csv.reader(csvfile, delimiter=',')
    for row in ap:
        x.append(int(row[2]))
        y.append(int(row[7]))

plt.scatter(x, y, color = 'g',s = 4, marker='o')
plt.show()

一种简单的方法是使用 Pandas with read_csv(), .groupby() and .agg():

import pandas as pd

df = pd.read_csv("mydata.csv", header=None)

def min_max_avg(col):
    return (col.min() + col.max()) / 2
    
result = df[[2, 3, 4]].groupby(2).agg(["min", "max", "mean", min_max_avg])

结果:

    3                               4                             
  min max       mean min_max_avg  min  max        mean min_max_avg
2                                                                 
1   1  90  33.666667        45.5  464  525  504.666667       494.5
2   3  55  28.000000        29.0  254  525  434.666667       389.5
3   3  30  18.333333        16.5  322  392  345.333333       357.0
4   3  44  23.333333        23.5  322  322  322.000000       322.0
5   1   3   2.000000         2.0  322  322  322.000000       322.0
6   1  33  22.333333        17.0  322  322  322.000000       322.0

如果您不喜欢这样,您可以使用纯 Python 来完成,只是需要多做一些工作:

import csv

data = {}
with open("mydata.csv", "r") as file:
    for row in csv.reader(file):
        dct = data.setdefault(row[2], {})
        for col in (3, 4):
            dct.setdefault(col, []).append(row[col])

min_str = "Min for group {} - column {}: {}"
max_str = "Max for group {} - column {}: {}"
for row in data:
    for col in (3, 4):
        print(min_str.format(row, col, min(data[row][col])))
        print(max_str.format(row, col, max(data[row][col])))

结果:

Min for group 1 - column 3: 1
Max for group 1 - column 3: 90
Min for group 1 - column 4: 464
Max for group 1 - column 4: 525
Min for group 2 - column 3: 26
Max for group 2 - column 3: 55
Min for group 2 - column 4: 254
Max for group 2 - column 4: 525
Min for group 3 - column 3: 22
Max for group 3 - column 3: 30
Min for group 3 - column 4: 322
Max for group 3 - column 4: 392
...

mydata.csv:

0,2,1,1,464,385,171,0:44:4
1,1,2,26,254,444,525,0:56:2
2,3,1,90,525,785,522,0:52:8
3,8,2,3,525,233,555,0:52:8
4,7,1,10,525,433,522,1:52:8
5,9,2,55,525,555,522,1:52:8
6,6,3,3,392,111,232,1:43:4
7,1,4,23,322,191,112,1:43:4
8,1,3,30,322,191,112,1:43:4
9,1,5,2,322,191,112,1:43:4
10,1,3,22,322,191,112,1:43:4
11,1,4,44,322,191,112,1:43:4
12,1,5,1,322,191,112,1:43:4
12,1,4,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4
12,1,6,1,322,191,112,1:43:4
12,1,5,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4