一行中值的条件平均值,具体取决于数据限定符
Conditional average of values in a row, depending on data qualifiers
希望你们一切都好。
所以我一生都在与 Excel 打交道,现在我正在切换到 Python 和 Pandas。事实证明,学习曲线对我来说非常陡峭,所以请与我分享。
日子一天天好起来。我已经设法聚合值,input/ouput 来自 csv/excel,删除“na”值等等。然而,我现在偶然发现了一堵高墙让我爬上去......
我创建了我正在使用的数据框的摘录。你可以在这里下载它,这样你就可以理解我将要写的内容:https://filetransfer.io/data-package/pWE9L29S#link
df_example
t_stamp,1_wind,2_wind,3_wind,4_wind,5_wind,6_wind,7_wind,1_wind_Q,2_wind_Q,3_wind_Q,4_wind_Q,5_wind_Q,6_wind_Q,7_wind_Q
2021-06-06 18:20:00,12.14397093693768,12.14570426940918,10.97993184016605,11.16468568605988,9.961717914791588,10.34653735907099,11.6856901451427,True,False,True,True,True,True,True
2021-05-10 19:00:00,8.045154709031468,8.572511270557484,8.499070711427668,7.949358210396142,8.252115912454919,7.116505042782365,8.815732567915179,True,True,True,True,True,True,True
2021-05-27 22:20:00,8.38946901817802,6.713454777683985,7.269814675171176,7.141862659613969,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-05 18:20:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-06-06 12:20:00,11.95525872119988,12.14570426940918,12.26086164116684,12.89527716859738,11.77172234144684,12.12409015586662,12.52180822809299,True,False,True,True,True,True,True
2021-06-04 03:30:00,14.72553364088618,12.72900662616056,10.59386275508178,10.96070182287055,12.38239256540934,12.07846616943932,10.58384464064597,True,True,True,True,False,True,True
2021-05-05 13:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-05-24 18:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.60747853230812,17.18577813727543,17.70745523935796,False,False,False,False,True,True,True
2021-05-07 19:00:00,13.94341927008482,10.95456999345216,13.36533234604886,0.0,3.782910539990379,10.86996953698871,13.45072022532649,True,True,True,False,False,True,True
2021-05-13 00:40:00,10.70940582779898,10.22222264510213,9.043496015164536,9.03805802580422,11.53775481234347,10.09538681656049,10.19345618536208,True,True,True,True,True,True,True
2021-05-27 19:40:00,10.8317678500958,7.929683248532885,8.264301219025942,8.184133252794958,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-09 12:00:00,10.55571650269678,7.635778078425459,10.43683108425784,7.847532146733346,8.100127641989639,7.770247510198059,8.040702032061867,True,True,True,True,True,True,True
2021-05-19 19:00:00,2.322496225799398,2.193219010982461,2.301622604435732,2.204278609893358,2.285408405883714,1.813280858368885,1.667207419773053,True,True,True,True,True,True,True
2021-05-30 12:30:00,5.776450801637788,8.488826231951345,10.98525552709715,7.03016556196849,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-24 14:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.93466266883504,17.04697174496121,17.0739475214739,False,False,False,False,True,False,True
你在看什么:
"n"表示测点数
- 第一列: 值的时间戳
- 列索引 1 到“n”: 过去 10 分钟不同点的平均风速
- 列索引“n+1”到最后一个 (-1): 如果相应点的值有效 (True) 或无效 (False),则合格。因此对于值“1_wind”,限定符“1_wind_Q”适用
我想要达到的目标:
目标是创建一个名为“Avg_WS”的新列,它遍历每一行并计算以下内容:
- 值范围的平均值,仅当相应的限定符为 TRUE 时
示例:因此,如果在给定行中,“4_wind_Q”列为“False”,则应从该给定行的平均值中排除值“4_wind”。
额外:如果给定行中的所有限定符均为“假”,则“Avg_WS”应等于同一行中的“NaN”。
我尝试过使用 apply,但我不知道如何匹配值限定符对
太感谢你了!
我尝试使用 mask
为此。
quals = ['1_wind_Q','2_wind_Q','3_wind_Q','4_wind_Q','5_wind_Q','6_wind_Q','7_wind_Q']
fields = ['1_wind', '2_wind', '3_wind', '4_wind', '5_wind', '6_wind', '7_wind']
df[fields].mask( ~df[quals].values ).mean( axis=1 )
# output
0 11.047089
1 8.178635
2 7.378650
3 NaN
4 12.254836
5 11.945236
6 NaN
7 17.500237
8 12.516802
9 10.119969
10 8.802471
11 8.626705
12 2.112502
13 8.070175
14 17.504305
dtype: float64
# assign this to the dataframe
df.loc[ :, 'Avg_WS' ] = df[fields].mask( ~df[quals].values ).mean( axis=1 )
mask
通过在每个“字段”上应用一个布尔掩码来工作 - 警告是布尔掩码必须与您尝试应用它的数据具有相同的形状(即必须具有相同的形状尺寸 n x m
)
mean( axis= 1 )
告诉数据框对每一行应用均值函数(而不是 axis=0
暗示的列。
希望你们一切都好。
所以我一生都在与 Excel 打交道,现在我正在切换到 Python 和 Pandas。事实证明,学习曲线对我来说非常陡峭,所以请与我分享。
日子一天天好起来。我已经设法聚合值,input/ouput 来自 csv/excel,删除“na”值等等。然而,我现在偶然发现了一堵高墙让我爬上去......
我创建了我正在使用的数据框的摘录。你可以在这里下载它,这样你就可以理解我将要写的内容:https://filetransfer.io/data-package/pWE9L29S#link
df_example
t_stamp,1_wind,2_wind,3_wind,4_wind,5_wind,6_wind,7_wind,1_wind_Q,2_wind_Q,3_wind_Q,4_wind_Q,5_wind_Q,6_wind_Q,7_wind_Q
2021-06-06 18:20:00,12.14397093693768,12.14570426940918,10.97993184016605,11.16468568605988,9.961717914791588,10.34653735907099,11.6856901451427,True,False,True,True,True,True,True
2021-05-10 19:00:00,8.045154709031468,8.572511270557484,8.499070711427668,7.949358210396142,8.252115912454919,7.116505042782365,8.815732567915179,True,True,True,True,True,True,True
2021-05-27 22:20:00,8.38946901817802,6.713454777683985,7.269814675171176,7.141862659613969,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-05 18:20:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-06-06 12:20:00,11.95525872119988,12.14570426940918,12.26086164116684,12.89527716859738,11.77172234144684,12.12409015586662,12.52180822809299,True,False,True,True,True,True,True
2021-06-04 03:30:00,14.72553364088618,12.72900662616056,10.59386275508178,10.96070182287055,12.38239256540934,12.07846616943932,10.58384464064597,True,True,True,True,False,True,True
2021-05-05 13:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-05-24 18:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.60747853230812,17.18577813727543,17.70745523935796,False,False,False,False,True,True,True
2021-05-07 19:00:00,13.94341927008482,10.95456999345216,13.36533234604886,0.0,3.782910539990379,10.86996953698871,13.45072022532649,True,True,True,False,False,True,True
2021-05-13 00:40:00,10.70940582779898,10.22222264510213,9.043496015164536,9.03805802580422,11.53775481234347,10.09538681656049,10.19345618536208,True,True,True,True,True,True,True
2021-05-27 19:40:00,10.8317678500958,7.929683248532885,8.264301219025942,8.184133252794958,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-09 12:00:00,10.55571650269678,7.635778078425459,10.43683108425784,7.847532146733346,8.100127641989639,7.770247510198059,8.040702032061867,True,True,True,True,True,True,True
2021-05-19 19:00:00,2.322496225799398,2.193219010982461,2.301622604435732,2.204278609893358,2.285408405883714,1.813280858368885,1.667207419773053,True,True,True,True,True,True,True
2021-05-30 12:30:00,5.776450801637788,8.488826231951345,10.98525552709715,7.03016556196849,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-24 14:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.93466266883504,17.04697174496121,17.0739475214739,False,False,False,False,True,False,True
你在看什么:
"n"表示测点数
- 第一列: 值的时间戳
- 列索引 1 到“n”: 过去 10 分钟不同点的平均风速
- 列索引“n+1”到最后一个 (-1): 如果相应点的值有效 (True) 或无效 (False),则合格。因此对于值“1_wind”,限定符“1_wind_Q”适用
我想要达到的目标: 目标是创建一个名为“Avg_WS”的新列,它遍历每一行并计算以下内容:
- 值范围的平均值,仅当相应的限定符为 TRUE 时
示例:因此,如果在给定行中,“4_wind_Q”列为“False”,则应从该给定行的平均值中排除值“4_wind”。
额外:如果给定行中的所有限定符均为“假”,则“Avg_WS”应等于同一行中的“NaN”。
我尝试过使用 apply,但我不知道如何匹配值限定符对
太感谢你了!
我尝试使用 mask
为此。
quals = ['1_wind_Q','2_wind_Q','3_wind_Q','4_wind_Q','5_wind_Q','6_wind_Q','7_wind_Q']
fields = ['1_wind', '2_wind', '3_wind', '4_wind', '5_wind', '6_wind', '7_wind']
df[fields].mask( ~df[quals].values ).mean( axis=1 )
# output
0 11.047089
1 8.178635
2 7.378650
3 NaN
4 12.254836
5 11.945236
6 NaN
7 17.500237
8 12.516802
9 10.119969
10 8.802471
11 8.626705
12 2.112502
13 8.070175
14 17.504305
dtype: float64
# assign this to the dataframe
df.loc[ :, 'Avg_WS' ] = df[fields].mask( ~df[quals].values ).mean( axis=1 )
mask
通过在每个“字段”上应用一个布尔掩码来工作 - 警告是布尔掩码必须与您尝试应用它的数据具有相同的形状(即必须具有相同的形状尺寸 n x m
)
mean( axis= 1 )
告诉数据框对每一行应用均值函数(而不是 axis=0
暗示的列。