如何将过滤后的分组聚合的结果分配为原始 Pandas DataFrame 中的新列

How can I assign the result of a filtered, grouped aggregation as a new column in the original Pandas DataFrame

我在从使用 R data.table 转换到使用 Pandas 进行数据处理时遇到了问题。

具体来说,我正在尝试将聚合结果作为新列分配回原始 df。请注意,聚合是两列的函数,所以我认为 df.transform() 不是正确的方法。

为了说明,我正在尝试复制我在 R 中所做的事情:

library(data.table)

df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))

df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
                        avg_id = mean(id) ),
    .(node, term)]

df

# id term node hours price vwap_ish avg_id
# 1:  1  qtr    A   300   107    90600      2
# 2:  2  qtr    A   300   104    90600      2
# 3:  3  qtr    A   300    91    90600      2
# 4:  4  qtr    B   300    89    95400      5
# 5:  5  qtr    B   300   113    95400      5
# 6:  6  qtr    B   300   116    95400      5
# 7:  7  mth    A    50   110       NA     NA
# 8:  8  mth    A   100   119       NA     NA
# 9:  9  mth    A   150    99       NA     NA
# 10: 10  mth    B    50   111       NA     NA
# 11: 11  mth    B   100   106       NA     NA
# 12: 12  mth    B   150   108       NA     NA

使用 Pandas,我可以从包含原始 df 的所有行的 df 创建一个对象,并使用聚合

import io
import numpy as np
import pandas as pd

data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")

df = pd.read_csv(data)

df1 = df.groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

df1

"""
              id term node  hours  price  vwap_ish  avg_id
node term                                                 
B  mth  9   10  mth  B     50    111     32350    10.0
          10  11  mth  B    100    106     32350    10.0
          11  12  mth  B    150    108     32350    10.0
     qtr  3    4  qtr  B    300     89     95400     4.0
          4    5  qtr  B    300    113     95400     4.0
          5    6  qtr  B    300    116     95400     4.0
A  mth  6    7  mth  A     50    110     32250     7.0
          7    8  mth  A    100    119     32250     7.0
          8    9  mth  A    150     99     32250     7.0
     qtr  0    1  qtr  A    300    107     90600     1.0
          1    2  qtr  A    300    104     90600     1.0
          2    3  qtr  A    300     91     90600     1.0
"""

这并没有真正让我得到我想要的,因为 a) 它重新排序并创建索引,并且 b) 它计算了所有行的聚合。

我可以很容易地得到子集


df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )

df2

"""
             id term node  hours  price  vwap_ish  avg_id
node term                                                
A    qtr  0   1  qtr    A    300    107     90600     1.0
          1   2  qtr    A    300    104     90600     1.0
          2   3  qtr    A    300     91     90600     1.0
B    qtr  3   4  qtr    B    300     89     95400     4.0
          4   5  qtr    B    300    113     95400     4.0
          5   6  qtr    B    300    116     95400     4.0
"""

但我无法将新列 (vwap_ish、avg_id) 中的值返回到旧 df 中。

我试过:

df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

还有 .merge 和 .join 的一些变体。例如:

df.merge(df2, how='left')

ValueError: 'term' is both an index level and a column label, which is ambiguous.

df.merge(df2, how='left', on=df.columns)

KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')

在写这篇文章时,我意识到我可以采用我的第一种方法,然后就这样做

df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN

但这似乎很老套。这意味着我 使用新列,而不是覆盖过滤行上的现有列,如果聚合函数中断,请说 if term='mth' 那么也会有问题。

我真的很感激任何帮助,因为尝试从 data.table 过渡到 Pandas 是一个非常陡峭的学习曲线,而且我会做很多事情-我花了几个小时才弄明白的班轮。

您可以为删除 MultiIndex 添加 group_keys=False 参数,因此左连接工作正常:

df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

df = df.merge(df2, how='left')
print (df)
    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN

没有左连接的解决方案:

m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
                                        .apply(lambda gp: gp.assign(
                                                     vwap_ish = (gp.hours * gp.price).sum(),
                                                     avg_id = np.mean(gp.id)
                                                      )
                                               ))

使用命名聚合改进解决方案并在 groupby 之前创建 vwap_ish 列可以提高性能:

df2 = (df[df.term == 'qtr']
         .assign(vwap_ish = lambda x: x.hours * x.price)
         .groupby(['node','term'], as_index=False)
         .agg(vwap_ish=('vwap_ish','sum'),
              avg_id=('id','mean')))

df = df.merge(df2, how='left')
print (df)
    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN

如果您愿意避免使用 apply(如果您热衷于性能),一种选择是将其分解为单独的步骤:

分组前计算小时数和价格的乘积:

temp = df.assign(vwap_ish = df.hours * df.price, avg_id = df.id)

过滤后得到groupby对象term:

temp = (temp
        .loc[temp.term.eq('qtr'), ['vwap_ish', 'avg_id']]
        .groupby([df.node, df.term])
        )

transform 分配回聚合值; pandas 将负责索引对齐:

(df
.assign(vwap_ish = temp.vwap_ish.transform('sum'), 
        avg_id = temp.avg_id.transform('mean'))
)

    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN

这只是一个旁白,你完全可以忽略它 - pydatatable attempts to mimic R's datatable as much as it can. This is one solution with pydatatable:

from datatable import dt, f, by, ifelse, update

DT = dt.Frame(df)

query = f.term == 'qtr'

agg = {'vwap_ish': ifelse(query, (f.hours * f.price), np.nan).sum(), 
       'avg_id' : ifelse(query, f.id.mean(), np.nan).sum()}

# update is a near equivalent to `:=`
DT[:, update(**agg), by('node', 'term')]

DT

   |    id  term   node   hours  price  vwap_ish   avg_id
   | int64  str32  str32  int64  int64   float64  float64
-- + -----  -----  -----  -----  -----  --------  -------
 0 |     1  qtr    A        300    107     90600        6
 1 |     2  qtr    A        300    104     90600        6
 2 |     3  qtr    A        300     91     90600        6
 3 |     4  qtr    B        300     89     95400       15
 4 |     5  qtr    B        300    113     95400       15
 5 |     6  qtr    B        300    116     95400       15
 6 |     7  mth    A         50    110        NA       NA
 7 |     8  mth    A        100    119        NA       NA
 8 |     9  mth    A        150     99        NA       NA
 9 |    10  mth    B         50    111        NA       NA
10 |    11  mth    B        100    106        NA       NA
11 |    12  mth    B        150    108        NA       NA
[12 rows x 7 columns]