Pandas 中的聚合

Aggregation in Pandas

  1. 如何使用 Pandas 执行聚合?
  2. 聚合后没有DataFrame!发生什么事了?
  3. 我如何主要聚合字符串列(lists、tuples、strings with separator)?
  4. 如何汇总计数?
  5. 如何创建一个由聚合值填充的新列?

我看到这些反复出现的问题询问 pandas 聚合功能的各个方面。 今天关于聚合及其各种用例的大部分信息都分散在数十个措辞糟糕、无法搜索的 post 中。 这里的目的是为了posterity整理一些比较重要的观点。

此问答是一系列有用的用户指南的下一部分:

请注意,此 post 不能替代 documentation about aggregation and about groupby,因此请同时阅读!

问题 1

如何使用 Pandas 执行聚合?

已扩展 aggregation documentation

聚合函数是减少返回对象维度的函数。这意味着输出 Series/DataFrame 与原始行相比具有更少或相同的行。

下面列出了一些常见的聚合函数:

Function    Description
mean()         Compute mean of groups
sum()         Compute sum of group values
size()         Compute group sizes
count()     Compute count of group
std()         Standard deviation of groups
var()         Compute variance of groups
sem()         Standard error of the mean of groups
describe()     Generates descriptive statistics
first()     Compute first of group values
last()         Compute last of group values
nth()         Take nth value, or a subset if n is a list
min()         Compute min of group values
max()         Compute max of group values
np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one'],
                   'C' : np.random.randint(5, size=6),
                   'D' : np.random.randint(5, size=6),
                   'E' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D  E
0  foo    one  2  3  0
1  foo    two  4  1  0
2  bar  three  2  1  1
3  foo    two  1  0  3
4  bar    two  3  1  4
5  foo    one  2  1  0

按过滤列聚合 Cython implemented functions:

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

groupby函数中没有指定的所有列都使用聚合函数,这里是A, B列:

df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

您也可以在groupby函数后仅指定列表中用于聚合的部分列:

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
     A      B  C  D
0  bar  three  2  1
1  bar    two  3  1
2  foo    one  4  4
3  foo    two  5  1

使用函数 DataFrameGroupBy.agg:

的结果相同
df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

对于应用于一列的多个函数,使用 tuples 的列表 - 新列和聚合函数的名称:

df4 = (df.groupby(['A', 'B'])['C']
         .agg([('average','mean'),('total','sum')])
         .reset_index())
print (df4)
     A      B  average  total
0  bar  three      2.0      2
1  bar    two      3.0      3
2  foo    one      2.0      4
3  foo    two      2.5      5

如果想传递多个函数是可能的传递 list of tuples:

df5 = (df.groupby(['A', 'B'])
         .agg([('average','mean'),('total','sum')]))

print (df5)
                C             D             E
          average total average total average total
A   B
bar three     2.0     2     1.0     1     1.0     1
    two       3.0     3     1.0     1     4.0     4
foo one       2.0     4     2.0     4     0.0     0
    two       2.5     5     0.5     1     1.5     3

然后得到MultiIndex列:

print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

为了转换为列,展平 MultiIndex 使用 mapjoin:

df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

另一个解决方案是传递聚合函数列表,然后展平 MultiIndex 并且对于另一个列名称使用 str.replace:

df5 = df.groupby(['A', 'B']).agg(['mean','sum'])

df5.columns = (df5.columns.map('_'.join)
                  .str.replace('sum','total')
                  .str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

如果要分别指定聚合函数的每一列传dictionary:

df6 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D':'mean'})
         .rename(columns={'C':'C_total', 'D':'D_average'}))
print (df6)
     A      B  C_total  D_average
0  bar  three        2        1.0
1  bar    two        3        1.0
2  foo    one        4        2.0
3  foo    two        5        0.5

你也可以传递自定义函数:

def func(x):
    return x.iat[0] + x.iat[-1]

df7 = (df.groupby(['A', 'B'], as_index=False)
         .agg({'C':'sum','D': func})
         .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'}))
print (df7)
     A      B  C_total  D_sum_first_and_last
0  bar  three        2                     2
1  bar    two        3                     2
2  foo    one        4                     4
3  foo    two        5                     1

问题 2

聚合后没有DataFrame!发生什么事了?

两列或更多列的聚合:

df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A    B
bar  three    2
     two      3
foo  one      4
     two      5
Name: C, dtype: int32

首先检查Pandas对象的Indextype

print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
           labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
           names=['A', 'B'])

print (type(df1))
<class 'pandas.core.series.Series'>

有两种方法可以得到MultiIndex Series列:

  • 添加参数as_index=False
df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5
df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

如果按一列分组:

df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar    5
foo    9
Name: C, dtype: int32

... 得到 SeriesIndex:

print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')

print (type(df2))
<class 'pandas.core.series.Series'>

解决方案与MultiIndex Series相同:

df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
     A  C
0  bar  5
1  foo  9

df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
     A  C
0  bar  5
1  foo  9

问题 3

如何聚合主要的字符串列(lists、tuples、strings with separator)?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
                   'D' : [1,2,3,2,3,1,2]})
print (df)
   A      B      C  D
0  a    one  three  1
1  c    two    one  2
2  b  three    two  3
3  b    two    two  2
4  a    two  three  3
5  c    one    two  1
6  b  three    one  2

可以通过 listtupleset 来代替聚合函数来转换列:

df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

另一种方法是使用 GroupBy.apply:

df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

要转换为带分隔符的字符串,仅当它是字符串列时才使用 .join

df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
   A                B
0  a          one,two
1  b  three,two,three
2  c          two,one

如果它是数字列,使用带有 astype 的 lambda 函数转换为 strings:

df3 = (df.groupby('A')['D']
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

另一个解决方案是在groupby之前转换为字符串:

df3 = (df.assign(D = df['D'].astype(str))
         .groupby('A')['D']
         .agg(','.join).reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

要转换所有列,请不要在 groupby 之后传递列列表。 没有任何列 D,因为 automatic exclusion of 'nuisance' columns。这意味着排除所有数字列。

df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
   A                B            C
0  a          one,two  three,three
1  b  three,two,three  two,two,one
2  c          two,one      one,two

所以需要把所有的列都转成字符串,然后得到所有的列:

df5 = (df.groupby('A')
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df5)
   A                B            C      D
0  a          one,two  three,three    1,3
1  b  three,two,three  two,two,one  3,2,2
2  c          two,one      one,two    2,1

问题 4

如何汇总计数?

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
                   'D' : [np.nan,2,3,2,3,np.nan,2]})
print (df)
   A      B      C    D
0  a    one  three  NaN
1  c    two    NaN  2.0
2  b  three    NaN  3.0
3  b    two    two  2.0
4  a    two  three  3.0
5  c    one    two  NaN
6  b  three    one  2.0

每组 size 的函数 GroupBy.size

df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
   A  COUNT
0  a      2
1  b      3
2  c      2

函数GroupBy.count排除缺失值:

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
   A  COUNT
0  a      2
1  b      2
2  c      1

这个函数应该用于多列计算非缺失值:

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
   A  B_COUNT  C_COUNT  D_COUNT
0  a        2        2        1
1  b        3        2        3
2  c        2        1        1

一个相关的函数是Series.value_counts。它 returns 包含按降序排列的唯一值计数的对象的大小,因此第一个元素是最常出现的元素。它默认排除 NaNs 个值。

df4 = (df['A'].value_counts()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df4)
   A  COUNT
0  b      3
1  a      2
2  c      2

如果您想要与使用函数 groupby + size 相同的输出,请添加 Series.sort_index:

df5 = (df['A'].value_counts()
              .sort_index()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df5)
   A  COUNT
0  a      2
1  b      3
2  c      2

问题 5

如何创建一个由聚合值填充的新列?

方法 GroupBy.transform returns 一个与被分组对象索引相同(大小相同)的对象。

有关详细信息,请参阅 the Pandas documentation

np.random.seed(123)

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                    'B' : ['one', 'two', 'three','two', 'two', 'one'],
                    'C' : np.random.randint(5, size=6),
                    'D' : np.random.randint(5, size=6)})
print (df)
     A      B  C  D
0  foo    one  2  3
1  foo    two  4  1
2  bar  three  2  1
3  foo    two  1  0
4  bar    two  3  1
5  foo    one  2  1


df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')


df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')

print (df)

     A      B  C  D  C1  C2  C3  D3  C4  D4
0  foo    one  2  3   9   4   9   5   4   4
1  foo    two  4  1   9   5   9   5   5   1
2  bar  three  2  1   5   2   5   2   2   1
3  foo    two  1  0   9   5   9   5   5   1
4  bar    two  3  1   5   3   5   2   3   1
5  foo    one  2  1   9   4   9   5   4   4

If you are coming from an R or SQL background, here are three examples that will teach you everything you need to do aggregation the way you are already familiar with:

Let us first create a Pandas dataframe

import pandas as pd

df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
                   'key2' : ['c','c','d','d','e'],
                   'value1' : [1,2,2,3,3],
                   'value2' : [9,8,7,6,5]})

df.head(5)

Here is how the table we created looks like:

key1 key2 value1 value2
a c 1 9
a c 2 8
a d 2 7
b d 3 6
a e 3 5

1. Aggregating With Row Reduction Similar to SQL Group By

1.1 If Pandas version >=0.25

Check your Pandas version by 运行ning print(pd.__version__). If your Pandas version is 0.25 or above then the following code will work:

df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'),
                                         sum_of_value_2=('value2', 'sum'),
                                         count_of_value1=('value1','size')
                                         ).reset_index()


df_agg.head(5)

The resulting data table will look like this:

key1 key2 mean_of_value1 sum_of_value2 count_of_value1
a c 1.5 17 2
a d 2.0 7 1
a e 3.0 5 1
b d 3.0 6 1

The SQL equivalent of this is:

SELECT
      key1
     ,key2
     ,AVG(value1) AS mean_of_value_1
     ,SUM(value2) AS sum_of_value_2
     ,COUNT(*) AS count_of_value1
FROM
    df
GROUP BY
     key1
    ,key2

1.2 If Pandas version <0.25

If your Pandas version is older than 0.25 then 运行ning the above code will give you the following error:

TypeError: aggregate() missing 1 required positional argument: 'arg'

Now to do the aggregation for both value1 and value2, you will 运行 this code:

df_agg = df.groupby(['key1','key2'],as_index=False).agg({'value1':['mean','count'],'value2':'sum'})

df_agg.columns = ['_'.join(col).strip() for col in df_agg.columns.values]

df_agg.head(5)

The resulting table will look like this:

key1 key2 value1_mean value1_count value2_sum
a c 1.5 2 17
a d 2.0 1 7
a e 3.0 1 5
b d 3.0 1 6

Renaming the columns needs to be done separately using the below code:

df_agg.rename(columns={"value1_mean" : "mean_of_value1",
                       "value1_count" : "count_of_value1",
                       "value2_sum" : "sum_of_value2"
                       }, inplace=True)

2. Create a Column Without Reduction in Rows (EXCEL - SUMIF, COUNTIF)

If you want to do a SUMIF, COUNTIF, etc., like how you would do in Excel where there is no reduction in rows, then you need to do this instead.

df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')

df.head(5)

The resulting data frame will look like this with the same number of rows as the original:

key1 key2 value1 value2 Total_of_value1_by_key1
a c 1 9 8
a c 2 8 8
a d 2 7 8
b d 3 6 3
a e 3 5 8

3. Creating a RANK Column ROW_NUMBER() OVER (PARTITION BY ORDER BY)

Finally, there might be cases where you want to create a rank column which is the SQL equivalent of ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC).

Here is how you do that.

 df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
              .groupby(['key1']) \
              .cumcount() + 1

 df.head(5)

Note: we make the code multi-line by adding \ at the end of each line.

Here is how the resulting data frame looks like:

key1 key2 value1 value2 RN
a c 1 9 4
a c 2 8 3
a d 2 7 2
b d 3 6 1
a e 3 5 1

In all the examples above, the final data table will have a table structure and won't have the pivot structure that you might get in other syntaxes.

Other aggregating operators:

mean() Compute mean of groups

sum() Compute sum of group values

size() Compute group sizes

count() Compute count of group

std() Standard deviation of groups

var() Compute variance of groups

sem() Standard error of the mean of groups

describe() Generates descriptive statistics

first() Compute first of group values

last() Compute last of group values

nth() Take nth value, or a subset if n is a list

min() Compute min of group values

max() Compute max of group values