根据其他列 ID 从现有数据框中获取新 pandas 数据框中的加权平均摘要数据列

Get weighted average summary data column in new pandas dataframe from existing dataframe based on other column-ID

与我之前在此处遇到的问题有些相似: 但是,我不想只计算数据点的总和,而是想在一个额外的列中获得加权平均值。我将重复并重新表述问题:

我想汇总数据框中的数据并将新列添加到另一个数据框中。我的数据包含带有 ID 号的公寓,并且包含公寓中每个房间的表面和 U 值。我想要的是有一个数据框来总结这一点,并为我提供每个公寓的总表面和表面加权平均 U 值。原始dataframe存在三个条件:

三个条件:

初始数据帧'data':

print(data)
    ID  Surface  U-value
0    2     10.0      1.0
1    2     12.0      1.0
2    2     24.0      0.5
3    2      8.0      1.0
4    4     84.0      0.8
5    4     84.0      0.8
6    4     84.0      0.8
7   52      NaN      0.2
8   52     96.0      1.0
9   95      8.0      2.0
10  95      6.0      2.0
11  95     12.0      2.0
12  95     30.0      1.0
13  95     12.0      1.5

'df' 的期望输出:

print(df)

    ID  Surface  U-value  #-> U-value = surface weighted U-value!; Surface = sum of all surfaces except when all surfaces per ID are the same (example 'ID 4')
0    2     54.0   0.777
1    4     84.0   0.8     #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2   52     96.0   1.0     # -> as one of 2 surface is empty, the corresponding U-value for the empty cell is ignored, so the output here should be the weighted average of the values that have both 'Surface'&'U-value'-values (in this case 1,0)
3   95     68.0   1.47

参考文献中 jezrael 的代码对于 sum() 已经很有效了,但是如何 向其中添加加权平均 'U-value' 列?我真的不知道。一个 可以使用 mean() 函数而不是 sum() 来设置平均值,但是 加权平均值..?

import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [2,4,52,95]})    

data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],                    
                "Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],     
                "U-value": 
[1.0,1.0,0.5,1.0,0.8,0.8,0.8,0.2,1.0,2.0,2.0,2.0,1.0,1.5]})    
print(data)

cols = ['Surface']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())

df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)

这应该可以解决问题:

data.groupby('ID').apply(lambda g: (g['U-value']*g['Surface']).sum() / g['Surface'].sum())

要添加到原始数据框,请不要先重置索引:

df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum()
df['U-value'] = data.groupby('ID').apply(
    lambda g: (g['U-value'] * g['Surface']).sum() / g['Surface'].sum())
df.reset_index(inplace=True)

结果:

   ID  Surface   U-value
0   2     54.0  0.777778
1   4     84.0  0.800000
2  52     96.0  1.000000
3  95     68.0  1.470588