如何找到数据帧最大 window 的源索引?
How to find source indexes of window max for dataframe?
我有一个 DatetimeIndex
的数据框,我想为每个 window 找到最大元素。但我也必须知道元素的索引。
示例数据:
data = pd.DataFrame(
index=pd.date_range(start=pd.to_datetime('2010-10-10 12:00:00'),
periods=10, freq='H'),
data={'value': [3, 2, 1, 0, 5, 1, 1, 1, 1, 1]}
)
如果我使用 max 滚动,我会丢失索引:
data.rolling(3).max()
输出:
value
2010-10-10 12:00:00 NaN
2010-10-10 13:00:00 NaN
2010-10-10 14:00:00 3.0
2010-10-10 15:00:00 2.0
2010-10-10 16:00:00 5.0
2010-10-10 17:00:00 5.0
2010-10-10 18:00:00 5.0
2010-10-10 19:00:00 1.0
2010-10-10 20:00:00 1.0
2010-10-10 21:00:00 1.0
如果我尝试使用 argmax,我会在每个 window 中获取索引作为整数索引(但我必须找到源日期时间索引或源数据帧的整数索引才能找到它们 iloc
):
data.rolling(3).apply(lambda x: x.argmax())
输出:
value
2010-10-10 12:00:00 NaN
2010-10-10 13:00:00 NaN
2010-10-10 14:00:00 0.0
2010-10-10 15:00:00 0.0
2010-10-10 16:00:00 2.0
2010-10-10 17:00:00 1.0
2010-10-10 18:00:00 0.0
2010-10-10 19:00:00 0.0
2010-10-10 20:00:00 0.0
2010-10-10 21:00:00 0.0
有谁能帮我在 pandas 中找到好的 function/parameters 吗?
当然我可以使用 for
比如:
pd.DataFrame([{'value_max': data[ind: ind + window][target_var].max(),
'source_index': data[ind: ind + window].index[data[ind: ind + window][target_var].values.argmax()]
} for ind in range(1, len(data) + 1 - window)],
index=data.index[1:-window+1])
并且有效。但我想尝试使用 pandas.
找到更优雅的解决方案
期望输出:
source_index value_max
2010-10-10 13:00:00 2010-10-10 13:00:00 2
2010-10-10 14:00:00 2010-10-10 16:00:00 5
2010-10-10 15:00:00 2010-10-10 16:00:00 5
2010-10-10 16:00:00 2010-10-10 16:00:00 5
2010-10-10 17:00:00 2010-10-10 17:00:00 1
2010-10-10 18:00:00 2010-10-10 18:00:00 1
2010-10-10 19:00:00 2010-10-10 19:00:00 1
尚未为 resampler
实现使用 Resampler.agg
with custom function, because idxmax
:
def idx(x):
return x.index.values[np.argmax(x.values)]
df = data['value'].rolling(3).agg(['max', idx])
df['idx'] = pd.to_datetime(df['idx'])
print (df)
max idx
2010-10-10 12:00:00 NaN NaT
2010-10-10 13:00:00 NaN NaT
2010-10-10 14:00:00 3.0 2010-10-10 12:00:00
2010-10-10 15:00:00 2.0 2010-10-10 13:00:00
2010-10-10 16:00:00 5.0 2010-10-10 16:00:00
2010-10-10 17:00:00 5.0 2010-10-10 16:00:00
2010-10-10 18:00:00 5.0 2010-10-10 16:00:00
2010-10-10 19:00:00 1.0 2010-10-10 17:00:00
2010-10-10 20:00:00 1.0 2010-10-10 18:00:00
2010-10-10 21:00:00 1.0 2010-10-10 19:00:00
谢谢@Sandeep Kadapa 改进解决方案:
def idx(x):
return x.idxmax().to_datetime64()
我有一个 DatetimeIndex
的数据框,我想为每个 window 找到最大元素。但我也必须知道元素的索引。
示例数据:
data = pd.DataFrame(
index=pd.date_range(start=pd.to_datetime('2010-10-10 12:00:00'),
periods=10, freq='H'),
data={'value': [3, 2, 1, 0, 5, 1, 1, 1, 1, 1]}
)
如果我使用 max 滚动,我会丢失索引:
data.rolling(3).max()
输出:
value
2010-10-10 12:00:00 NaN
2010-10-10 13:00:00 NaN
2010-10-10 14:00:00 3.0
2010-10-10 15:00:00 2.0
2010-10-10 16:00:00 5.0
2010-10-10 17:00:00 5.0
2010-10-10 18:00:00 5.0
2010-10-10 19:00:00 1.0
2010-10-10 20:00:00 1.0
2010-10-10 21:00:00 1.0
如果我尝试使用 argmax,我会在每个 window 中获取索引作为整数索引(但我必须找到源日期时间索引或源数据帧的整数索引才能找到它们 iloc
):
data.rolling(3).apply(lambda x: x.argmax())
输出:
value
2010-10-10 12:00:00 NaN
2010-10-10 13:00:00 NaN
2010-10-10 14:00:00 0.0
2010-10-10 15:00:00 0.0
2010-10-10 16:00:00 2.0
2010-10-10 17:00:00 1.0
2010-10-10 18:00:00 0.0
2010-10-10 19:00:00 0.0
2010-10-10 20:00:00 0.0
2010-10-10 21:00:00 0.0
有谁能帮我在 pandas 中找到好的 function/parameters 吗?
当然我可以使用 for
比如:
pd.DataFrame([{'value_max': data[ind: ind + window][target_var].max(),
'source_index': data[ind: ind + window].index[data[ind: ind + window][target_var].values.argmax()]
} for ind in range(1, len(data) + 1 - window)],
index=data.index[1:-window+1])
并且有效。但我想尝试使用 pandas.
找到更优雅的解决方案期望输出:
source_index value_max
2010-10-10 13:00:00 2010-10-10 13:00:00 2
2010-10-10 14:00:00 2010-10-10 16:00:00 5
2010-10-10 15:00:00 2010-10-10 16:00:00 5
2010-10-10 16:00:00 2010-10-10 16:00:00 5
2010-10-10 17:00:00 2010-10-10 17:00:00 1
2010-10-10 18:00:00 2010-10-10 18:00:00 1
2010-10-10 19:00:00 2010-10-10 19:00:00 1
尚未为 resampler
实现使用 Resampler.agg
with custom function, because idxmax
:
def idx(x):
return x.index.values[np.argmax(x.values)]
df = data['value'].rolling(3).agg(['max', idx])
df['idx'] = pd.to_datetime(df['idx'])
print (df)
max idx
2010-10-10 12:00:00 NaN NaT
2010-10-10 13:00:00 NaN NaT
2010-10-10 14:00:00 3.0 2010-10-10 12:00:00
2010-10-10 15:00:00 2.0 2010-10-10 13:00:00
2010-10-10 16:00:00 5.0 2010-10-10 16:00:00
2010-10-10 17:00:00 5.0 2010-10-10 16:00:00
2010-10-10 18:00:00 5.0 2010-10-10 16:00:00
2010-10-10 19:00:00 1.0 2010-10-10 17:00:00
2010-10-10 20:00:00 1.0 2010-10-10 18:00:00
2010-10-10 21:00:00 1.0 2010-10-10 19:00:00
谢谢@Sandeep Kadapa 改进解决方案:
def idx(x):
return x.idxmax().to_datetime64()