如何将过滤后的分组聚合的结果分配为原始 Pandas DataFrame 中的新列
How can I assign the result of a filtered, grouped aggregation as a new column in the original Pandas DataFrame
我在从使用 R data.table 转换到使用 Pandas 进行数据处理时遇到了问题。
具体来说,我正在尝试将聚合结果作为新列分配回原始 df。请注意,聚合是两列的函数,所以我认为 df.transform()
不是正确的方法。
为了说明,我正在尝试复制我在 R 中所做的事情:
library(data.table)
df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))
df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
avg_id = mean(id) ),
.(node, term)]
df
# id term node hours price vwap_ish avg_id
# 1: 1 qtr A 300 107 90600 2
# 2: 2 qtr A 300 104 90600 2
# 3: 3 qtr A 300 91 90600 2
# 4: 4 qtr B 300 89 95400 5
# 5: 5 qtr B 300 113 95400 5
# 6: 6 qtr B 300 116 95400 5
# 7: 7 mth A 50 110 NA NA
# 8: 8 mth A 100 119 NA NA
# 9: 9 mth A 150 99 NA NA
# 10: 10 mth B 50 111 NA NA
# 11: 11 mth B 100 106 NA NA
# 12: 12 mth B 150 108 NA NA
使用 Pandas,我可以从包含原始 df 的所有行的 df 创建一个对象,并使用聚合
import io
import numpy as np
import pandas as pd
data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")
df = pd.read_csv(data)
df1 = df.groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df1
"""
id term node hours price vwap_ish avg_id
node term
B mth 9 10 mth B 50 111 32350 10.0
10 11 mth B 100 106 32350 10.0
11 12 mth B 150 108 32350 10.0
qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
A mth 6 7 mth A 50 110 32250 7.0
7 8 mth A 100 119 32250 7.0
8 9 mth A 150 99 32250 7.0
qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
"""
这并没有真正让我得到我想要的,因为 a) 它重新排序并创建索引,并且 b) 它计算了所有行的聚合。
我可以很容易地得到子集
df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
df2
"""
id term node hours price vwap_ish avg_id
node term
A qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
B qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
"""
但我无法将新列 (vwap_ish、avg_id) 中的值返回到旧 df 中。
我试过:
df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
还有 .merge 和 .join 的一些变体。例如:
df.merge(df2, how='left')
ValueError: 'term' is both an index level and a column label, which is ambiguous.
和
df.merge(df2, how='left', on=df.columns)
KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')
在写这篇文章时,我意识到我可以采用我的第一种方法,然后就这样做
df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN
但这似乎很老套。这意味着我 有 使用新列,而不是覆盖过滤行上的现有列,如果聚合函数中断,请说 if term='mth' 那么也会有问题。
我真的很感激任何帮助,因为尝试从 data.table 过渡到 Pandas 是一个非常陡峭的学习曲线,而且我会做很多事情-我花了几个小时才弄明白的班轮。
您可以为删除 MultiIndex
添加 group_keys=False
参数,因此左连接工作正常:
df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
没有左连接的解决方案:
m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
.apply(lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
))
使用命名聚合改进解决方案并在 groupby
之前创建 vwap_ish
列可以提高性能:
df2 = (df[df.term == 'qtr']
.assign(vwap_ish = lambda x: x.hours * x.price)
.groupby(['node','term'], as_index=False)
.agg(vwap_ish=('vwap_ish','sum'),
avg_id=('id','mean')))
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
如果您愿意避免使用 apply
(如果您热衷于性能),一种选择是将其分解为单独的步骤:
分组前计算小时数和价格的乘积:
temp = df.assign(vwap_ish = df.hours * df.price, avg_id = df.id)
过滤后得到groupby对象term
:
temp = (temp
.loc[temp.term.eq('qtr'), ['vwap_ish', 'avg_id']]
.groupby([df.node, df.term])
)
用 transform
分配回聚合值; pandas 将负责索引对齐:
(df
.assign(vwap_ish = temp.vwap_ish.transform('sum'),
avg_id = temp.avg_id.transform('mean'))
)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
这只是一个旁白,你完全可以忽略它 - pydatatable attempts to mimic R's datatable as much as it can. This is one solution with pydatatable:
from datatable import dt, f, by, ifelse, update
DT = dt.Frame(df)
query = f.term == 'qtr'
agg = {'vwap_ish': ifelse(query, (f.hours * f.price), np.nan).sum(),
'avg_id' : ifelse(query, f.id.mean(), np.nan).sum()}
# update is a near equivalent to `:=`
DT[:, update(**agg), by('node', 'term')]
DT
| id term node hours price vwap_ish avg_id
| int64 str32 str32 int64 int64 float64 float64
-- + ----- ----- ----- ----- ----- -------- -------
0 | 1 qtr A 300 107 90600 6
1 | 2 qtr A 300 104 90600 6
2 | 3 qtr A 300 91 90600 6
3 | 4 qtr B 300 89 95400 15
4 | 5 qtr B 300 113 95400 15
5 | 6 qtr B 300 116 95400 15
6 | 7 mth A 50 110 NA NA
7 | 8 mth A 100 119 NA NA
8 | 9 mth A 150 99 NA NA
9 | 10 mth B 50 111 NA NA
10 | 11 mth B 100 106 NA NA
11 | 12 mth B 150 108 NA NA
[12 rows x 7 columns]
我在从使用 R data.table 转换到使用 Pandas 进行数据处理时遇到了问题。
具体来说,我正在尝试将聚合结果作为新列分配回原始 df。请注意,聚合是两列的函数,所以我认为 df.transform()
不是正确的方法。
为了说明,我正在尝试复制我在 R 中所做的事情:
library(data.table)
df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))
df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
avg_id = mean(id) ),
.(node, term)]
df
# id term node hours price vwap_ish avg_id
# 1: 1 qtr A 300 107 90600 2
# 2: 2 qtr A 300 104 90600 2
# 3: 3 qtr A 300 91 90600 2
# 4: 4 qtr B 300 89 95400 5
# 5: 5 qtr B 300 113 95400 5
# 6: 6 qtr B 300 116 95400 5
# 7: 7 mth A 50 110 NA NA
# 8: 8 mth A 100 119 NA NA
# 9: 9 mth A 150 99 NA NA
# 10: 10 mth B 50 111 NA NA
# 11: 11 mth B 100 106 NA NA
# 12: 12 mth B 150 108 NA NA
使用 Pandas,我可以从包含原始 df 的所有行的 df 创建一个对象,并使用聚合
import io
import numpy as np
import pandas as pd
data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")
df = pd.read_csv(data)
df1 = df.groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df1
"""
id term node hours price vwap_ish avg_id
node term
B mth 9 10 mth B 50 111 32350 10.0
10 11 mth B 100 106 32350 10.0
11 12 mth B 150 108 32350 10.0
qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
A mth 6 7 mth A 50 110 32250 7.0
7 8 mth A 100 119 32250 7.0
8 9 mth A 150 99 32250 7.0
qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
"""
这并没有真正让我得到我想要的,因为 a) 它重新排序并创建索引,并且 b) 它计算了所有行的聚合。
我可以很容易地得到子集
df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
df2
"""
id term node hours price vwap_ish avg_id
node term
A qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
B qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
"""
但我无法将新列 (vwap_ish、avg_id) 中的值返回到旧 df 中。
我试过:
df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
还有 .merge 和 .join 的一些变体。例如:
df.merge(df2, how='left')
ValueError: 'term' is both an index level and a column label, which is ambiguous.
和
df.merge(df2, how='left', on=df.columns)
KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')
在写这篇文章时,我意识到我可以采用我的第一种方法,然后就这样做
df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN
但这似乎很老套。这意味着我 有 使用新列,而不是覆盖过滤行上的现有列,如果聚合函数中断,请说 if term='mth' 那么也会有问题。
我真的很感激任何帮助,因为尝试从 data.table 过渡到 Pandas 是一个非常陡峭的学习曲线,而且我会做很多事情-我花了几个小时才弄明白的班轮。
您可以为删除 MultiIndex
添加 group_keys=False
参数,因此左连接工作正常:
df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
没有左连接的解决方案:
m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
.apply(lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
))
使用命名聚合改进解决方案并在 groupby
之前创建 vwap_ish
列可以提高性能:
df2 = (df[df.term == 'qtr']
.assign(vwap_ish = lambda x: x.hours * x.price)
.groupby(['node','term'], as_index=False)
.agg(vwap_ish=('vwap_ish','sum'),
avg_id=('id','mean')))
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
如果您愿意避免使用 apply
(如果您热衷于性能),一种选择是将其分解为单独的步骤:
分组前计算小时数和价格的乘积:
temp = df.assign(vwap_ish = df.hours * df.price, avg_id = df.id)
过滤后得到groupby对象term
:
temp = (temp
.loc[temp.term.eq('qtr'), ['vwap_ish', 'avg_id']]
.groupby([df.node, df.term])
)
用 transform
分配回聚合值; pandas 将负责索引对齐:
(df
.assign(vwap_ish = temp.vwap_ish.transform('sum'),
avg_id = temp.avg_id.transform('mean'))
)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
这只是一个旁白,你完全可以忽略它 - pydatatable attempts to mimic R's datatable as much as it can. This is one solution with pydatatable:
from datatable import dt, f, by, ifelse, update
DT = dt.Frame(df)
query = f.term == 'qtr'
agg = {'vwap_ish': ifelse(query, (f.hours * f.price), np.nan).sum(),
'avg_id' : ifelse(query, f.id.mean(), np.nan).sum()}
# update is a near equivalent to `:=`
DT[:, update(**agg), by('node', 'term')]
DT
| id term node hours price vwap_ish avg_id
| int64 str32 str32 int64 int64 float64 float64
-- + ----- ----- ----- ----- ----- -------- -------
0 | 1 qtr A 300 107 90600 6
1 | 2 qtr A 300 104 90600 6
2 | 3 qtr A 300 91 90600 6
3 | 4 qtr B 300 89 95400 15
4 | 5 qtr B 300 113 95400 15
5 | 6 qtr B 300 116 95400 15
6 | 7 mth A 50 110 NA NA
7 | 8 mth A 100 119 NA NA
8 | 9 mth A 150 99 NA NA
9 | 10 mth B 50 111 NA NA
10 | 11 mth B 100 106 NA NA
11 | 12 mth B 150 108 NA NA
[12 rows x 7 columns]