具有权重的 nanmean 用于计算 pandas .agg 中的加权平均值
nanmean with weights to calculate weighted average in pandas .agg
我在 pandas 聚合中使用 lambda 函数来计算加权平均值。我的问题是,如果其中一个值是 nan,则整个结果就是该组的 nan。我怎样才能避免这种情况?
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns = ['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df.loc['b','four'] ='foo'
df.loc['c','four'] ='foo'
one two three four five found
a 1.046540 -0.304646 -0.982008 bar True NaN
b NaN NaN NaN foo NaN foo
c -1.086525 1.086501 0.403910 foo False NaN
d NaN NaN NaN NaN NaN NaN
e 0.569420 0.105422 0.192559 bar True NaN
f 0.384400 -0.558321 0.324624 bar True NaN
g NaN NaN NaN NaN NaN NaN
h 0.656231 -2.185062 0.180535 bar True NaN
df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', lambda x: np.average(x, weights=df.loc[x.index, 'two'])))
sum weighted_avg
four
bar -2.942608 0.648173
foo 1.086501 NaN
期望的结果:
sum weighted_avg
four
bar -2.942608 0.648173
foo 1.086501 -1.086525
与this question不同的是,这不是列的实际值没有出现的问题,是nanmean没有加权选项的问题。
另一个数值例子:
x y
0 NaN 18.0
1 NaN 21.0
2 NaN 38.0
3 56.0 150.0
4 65.0 154.0
这里我们只想 return 最后两行的加权平均值,而忽略包含 nan 的其他行。
对我来说实施了 this 解决方案:
def f(x):
indices = ~np.isnan(x)
return np.average(x[indices], weights=df.loc[x.index[indices], 'two'])
df = df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', f))
print (df)
sum weighted_avg
four
bar -2.942607 0.648173
foo 1.086501 -1.086525
编辑:
def f(x):
indices = ~np.isnan(x)
if indices.all():
return np.average(x[indices], weights=df.loc[x.index[indices], 'two'])
else:
return np.nan
这似乎更稳健:
def f(x):
indices = (~np.isnan(x)) & (~np.isnan(df[weight_column]))[x.index]
try:
return np.average(x[indices], weights=df.loc[x.index[indices], weight_column])
except ZeroDivisionError:
return np.nan
df = df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', f))
我在 pandas 聚合中使用 lambda 函数来计算加权平均值。我的问题是,如果其中一个值是 nan,则整个结果就是该组的 nan。我怎样才能避免这种情况?
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns = ['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df.loc['b','four'] ='foo'
df.loc['c','four'] ='foo'
one two three four five found
a 1.046540 -0.304646 -0.982008 bar True NaN
b NaN NaN NaN foo NaN foo
c -1.086525 1.086501 0.403910 foo False NaN
d NaN NaN NaN NaN NaN NaN
e 0.569420 0.105422 0.192559 bar True NaN
f 0.384400 -0.558321 0.324624 bar True NaN
g NaN NaN NaN NaN NaN NaN
h 0.656231 -2.185062 0.180535 bar True NaN
df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', lambda x: np.average(x, weights=df.loc[x.index, 'two'])))
sum weighted_avg
four
bar -2.942608 0.648173
foo 1.086501 NaN
期望的结果:
sum weighted_avg
four
bar -2.942608 0.648173
foo 1.086501 -1.086525
与this question不同的是,这不是列的实际值没有出现的问题,是nanmean没有加权选项的问题。
另一个数值例子:
x y
0 NaN 18.0
1 NaN 21.0
2 NaN 38.0
3 56.0 150.0
4 65.0 154.0
这里我们只想 return 最后两行的加权平均值,而忽略包含 nan 的其他行。
对我来说实施了 this 解决方案:
def f(x):
indices = ~np.isnan(x)
return np.average(x[indices], weights=df.loc[x.index[indices], 'two'])
df = df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', f))
print (df)
sum weighted_avg
four
bar -2.942607 0.648173
foo 1.086501 -1.086525
编辑:
def f(x):
indices = ~np.isnan(x)
if indices.all():
return np.average(x[indices], weights=df.loc[x.index[indices], 'two'])
else:
return np.nan
这似乎更稳健:
def f(x):
indices = (~np.isnan(x)) & (~np.isnan(df[weight_column]))[x.index]
try:
return np.average(x[indices], weights=df.loc[x.index[indices], weight_column])
except ZeroDivisionError:
return np.nan
df = df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', f))