一行中值的条件平均值,具体取决于数据限定符

Conditional average of values in a row, depending on data qualifiers

希望你们一切都好。

所以我一生都在与 Excel 打交道,现在我正在切换到 Python 和 Pandas。事实证明,学习曲线对我来说非常陡峭,所以请与我分享。

日子一天天好起来。我已经设法聚合值,input/ouput 来自 csv/excel,删除“na”值等等。然而,我现在偶然发现了一堵高墙让我爬上去......

我创建了我正在使用的数据框的摘录。你可以在这里下载它,这样你就可以理解我将要写的内容:https://filetransfer.io/data-package/pWE9L29S#link

df_example

t_stamp,1_wind,2_wind,3_wind,4_wind,5_wind,6_wind,7_wind,1_wind_Q,2_wind_Q,3_wind_Q,4_wind_Q,5_wind_Q,6_wind_Q,7_wind_Q
2021-06-06 18:20:00,12.14397093693768,12.14570426940918,10.97993184016605,11.16468568605988,9.961717914791588,10.34653735907099,11.6856901451427,True,False,True,True,True,True,True
2021-05-10 19:00:00,8.045154709031468,8.572511270557484,8.499070711427668,7.949358210396142,8.252115912454919,7.116505042782365,8.815732567915179,True,True,True,True,True,True,True
2021-05-27 22:20:00,8.38946901817802,6.713454777683985,7.269814675171176,7.141862659613969,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-05 18:20:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-06-06 12:20:00,11.95525872119988,12.14570426940918,12.26086164116684,12.89527716859738,11.77172234144684,12.12409015586662,12.52180822809299,True,False,True,True,True,True,True
2021-06-04 03:30:00,14.72553364088618,12.72900662616056,10.59386275508178,10.96070182287055,12.38239256540934,12.07846616943932,10.58384464064597,True,True,True,True,False,True,True
2021-05-05 13:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-05-24 18:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.60747853230812,17.18577813727543,17.70745523935796,False,False,False,False,True,True,True
2021-05-07 19:00:00,13.94341927008482,10.95456999345216,13.36533234604886,0.0,3.782910539990379,10.86996953698871,13.45072022532649,True,True,True,False,False,True,True
2021-05-13 00:40:00,10.70940582779898,10.22222264510213,9.043496015164536,9.03805802580422,11.53775481234347,10.09538681656049,10.19345618536208,True,True,True,True,True,True,True
2021-05-27 19:40:00,10.8317678500958,7.929683248532885,8.264301219025942,8.184133252794958,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-09 12:00:00,10.55571650269678,7.635778078425459,10.43683108425784,7.847532146733346,8.100127641989639,7.770247510198059,8.040702032061867,True,True,True,True,True,True,True
2021-05-19 19:00:00,2.322496225799398,2.193219010982461,2.301622604435732,2.204278609893358,2.285408405883714,1.813280858368885,1.667207419773053,True,True,True,True,True,True,True
2021-05-30 12:30:00,5.776450801637788,8.488826231951345,10.98525552709715,7.03016556196849,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-24 14:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.93466266883504,17.04697174496121,17.0739475214739,False,False,False,False,True,False,True

你在看什么:

"n"表示测点数

我想要达到的目标: 目标是创建一个名为“Avg_WS”的新列,它遍历每一行并计算以下内容:

示例:因此,如果在给定行中,“4_wind_Q”列为“False”,则应从该给定行的平均值中排除值“4_wind”。

额外:如果给定行中的所有限定符均为“假”,则“Avg_WS”应等于同一行中的“NaN”。

我尝试过使用 apply,但我不知道如何匹配值限定符对

太感谢你了!

我尝试使用 mask 为此。

quals = ['1_wind_Q','2_wind_Q','3_wind_Q','4_wind_Q','5_wind_Q','6_wind_Q','7_wind_Q']
fields = ['1_wind', '2_wind', '3_wind', '4_wind', '5_wind', '6_wind', '7_wind']

df[fields].mask( ~df[quals].values ).mean( axis=1 )

# output

0     11.047089
1      8.178635
2      7.378650
3           NaN
4     12.254836
5     11.945236
6           NaN
7     17.500237
8     12.516802
9     10.119969
10     8.802471
11     8.626705
12     2.112502
13     8.070175
14    17.504305
dtype: float64

# assign this to the dataframe
df.loc[ :, 'Avg_WS' ] = df[fields].mask( ~df[quals].values ).mean( axis=1 )

mask 通过在每个“字段”上应用一个布尔掩码来工作 - 警告是布尔掩码必须与您尝试应用它的数据具有相同的形状(即必须具有相同的形状尺寸 n x m)

mean( axis= 1 ) 告诉数据框对每一行应用均值函数(而不是 axis=0 暗示的列。