使用 pandas.cut() 并将其设置为数据帧的索引

Question

我正在尝试找到一种更简单的方法来运行使用我的数据框聚合函数，而不是手动提取数据并运行将函数与数据框本身分开。我有一支球队的足球统计数据，我想运行根据年龄进行分析和统计。我想对年龄进行分类，然后运行基于这些年龄组的统计数据。更具体地说，我有一个 df:

df = pd.DataFrame({'Age':[20,30,22,27,35,33,22,28,29,21,28,33,29,27,31,20,25,26,31,33,29,18],
             'Goals':np.random.randint(1,6,22),
             'Shots on Goals':np.random.randint(5,20,22),
             'Yellow Cards':np.random.randint(1,6,22),
             'Assists':np.random.randint(0,16,22)})
df['Age Grps'] = pd.cut(df.Age, bins =[17,24,28,32,36])
df.set_index(['Age Grps'], inplace = True)
df.head(8)

输出以下数据框，索引设置为分箱年龄组：

| Age Grps | Age | Assists | Goals | Shot on Goals | Yellow Cards |
|----------|-----|---------|-------|---------------|--------------|
|  (17,24] |  20 |    3    |   3   |       13      |       2      |
| (28, 32] |  30 |    2    |   3   |       11      |       3      |
|  (17,24] |  22 |    10   |   3   |       14      |       5      |
|  (24,28] |  27 |    3    |   1   |       16      |       3      |
|  (32,36] |  35 |    1    |   4   |       5       |       1      |
|  (32,36] |  33 |    5    |   4   |       17      |       1      |
|  (17,24] |  22 |    14   |   5   |       13      |       3      |
|  (24,28] |  28 |    14   |   2   |       7       |       4      |

是否可以根据当前索引 (Age Grps) 进行分组以产生以下结果：

╔══════════╦═════╦═════════╦═══════╦═══════════════╦══════════════╗
║ Age Grps ║ Age ║ Assists ║ Goals ║ Shot on Goals ║ Yellow Cards ║
╠══════════╬═════╬═════════╬═══════╬═══════════════╬══════════════╣
║  (17,24] ║  20 ║    3    ║   3   ║       13      ║       2      ║
║          ╠═════╬═════════╬═══════╬═══════════════╬══════════════╣
║          ║  22 ║    14   ║   5   ║       13      ║       3      ║
║          ╠═════╬═════════╬═══════╬═══════════════╬══════════════╣
║          ║  22 ║    10   ║   3   ║       14      ║       5      ║
╠══════════╬═════╬═════════╬═══════╬═══════════════╬══════════════╣
║  (24,28] ║  27 ║    3    ║   1   ║       16      ║       3      ║
║          ╠═════╬═════════╬═══════╬═══════════════╬══════════════╣
║          ║  28 ║    14   ║   2   ║       7       ║       4      ║
╠══════════╬═════╬═════════╬═══════╬═══════════════╬══════════════╣
║  (28,32] ║  28 ║    14   ║   2   ║       7       ║       4      ║
╠══════════╬═════╬═════════╬═══════╬═══════════════╬══════════════╣
║  (32,36] ║  35 ║    1    ║   4   ║       5       ║       1      ║
║          ╠═════╬═════════╬═══════╬═══════════════╬══════════════╣
║          ║  33 ║    5    ║   4   ║       17      ║       4      ║
╚══════════╩═════╩═════════╩═══════╩═══════════════╩══════════════╝

我想做的是运行每个年龄组的汇总统计数据，例如每个年龄组的平均助攻数、平均进球数、平均射门数等等。类似：

df['Average Goals'] = df.groupby('bucket')['Goals'].mean()
df['Average Assists'] = df.groupby('bucket')['Assists'].mean()

为了生成这样的 table:

╔══════════╦═════╦═════════╦═════════════════╦═══════╦═══════════════╦═══════════════╦══════════════╗
║ Index    ║ Age ║ Assists ║ Average Assists ║ Goals ║ Average Goals ║ Shot on Goals ║ Yellow Cards ║
╠══════════╬═════╬═════════╬═════════════════╬═══════╬═══════════════╬═══════════════╬══════════════╣
║  (17,24] ║  20 ║    3    ║        9        ║   3   ║      3.67     ║       13      ║       2      ║
║          ╠═════╬═════════╣                 ╬═══════╬               ╬═══════════════╬══════════════╣
║          ║  22 ║    14   ║                 ║   5   ║               ║       13      ║       3      ║
║          ╠═════╬═════════╣                 ╬═══════╬               ╬═══════════════╬══════════════╣
║          ║  22 ║    10   ║                 ║   3   ║               ║       14      ║       5      ║
╠══════════╬═════╬═════════╬═════════════════╬═══════╬═══════════════╬═══════════════╬══════════════╣
║  (24,28] ║  27 ║    3    ║       8.5       ║   1   ║      1.5      ║       16      ║       3      ║
║          ╠═════╬═════════╣                 ╬═══════╬               ╬═══════════════╬══════════════╣
║          ║  28 ║    14   ║                 ║   2   ║               ║       7       ║       4      ║ 
╠══════════╬═════╬═════════╬═════════════════╬═══════╬═══════════════╬═══════════════╬══════════════╣
║  (28,32] ║  28 ║    14   ║        14       ║   2   ║       2       ║       7       ║       4      ║
╠══════════╬═════╬═════════╬═════════════════╬═══════╬═══════════════╬═══════════════╬══════════════╣
║  (32,36] ║  35 ║    1    ║        3        ║   4   ║       4       ║       5       ║       1      ║
║          ╠═════╬═════════╣                 ╬═══════╬               ╬═══════════════╬══════════════╣
║          ║  33 ║    5    ║                 ║   4   ║               ║       17      ║       4      ║
╚══════════╩═════╩═════════╩═════════════════╩═══════╩═══════════════╩═══════════════╩══════════════╝

我知道我可以以列表的形式提取数据并执行我需要的统计，但我正在尝试以 "pandorable" 的方式做事。此外，我将使用 matplotlib 绘制这些数据，我想使用 pandas 和 matplotlib API df.plot().

的简单方法

在此先感谢您的帮助

Answer 1

我想你想要 transform 如果需要新列到原始 df，但是如果从列 Age Grps 设置索引，它 return 会出现很多警告：

df['Age Grps'] = pd.cut(df.Age, bins =[17,24,28,32,36])
df = df.sort_values('Age Grps')
df['Average Goals'] = df.groupby('Age Grps')['Goals'].transform('mean')
df['Average Assists'] = df.groupby('Age Grps')['Assists'].transform('mean')

但如果需要聚合数据使用DataFrameGroupBy.agg:

df1 = df.groupby(pd.cut(df.Age, bins =[17,24,28,32,36]))
        .agg({'Goals':'mean', 'Assists':'mean', 'Yellow Cards':'sum'})
print (df1)
          Yellow Cards    Assists     Goals
Age                                        
(17, 24]            12   8.000000  3.166667
(24, 28]            18   4.833333  1.833333
(28, 32]            21  11.333333  3.000000
(32, 36]            13  11.000000  2.250000

使用 pandas.cut() 并将其设置为数据帧的索引

using pandas.cut() and setting it as the index of a dataframe

python

numpy

matplotlib

pandas

pandas-groupby