在 Pandas 中，如何 return 下一个值的 id 是 above/below 阈值

Question

我有一个这样的数据框：

    date                value
0   2018-05-15 06:00:00 100.86
1   2018-05-15 07:00:00 101.99
2   2018-05-15 08:00:00 110.00
3   2018-05-15 09:00:00 95.49
4   2018-05-15 10:00:00 92.32

我想创建一个新列，告诉我下一个值的索引，该值高于或低于一定数量。例如 5 岁以上，3 岁以下

    date                value  Over_5 Under_3
0   2018-05-15 06:00:00 100.86  2     3
1   2018-05-15 07:00:00 101.99  2     3
2   2018-05-15 08:00:00 110.00  Nan   3
3   2018-05-15 09:00:00 95.49   Nan   4
4   2018-05-15 10:00:00 92.32   Nan   Nan

理想情况下，我想对 return 使用布尔运算符，哪个先出现，大于 5 (0) 还是小于 3(1)？

    date                value  Over_5 Under_3 Bool
0   2018-05-15 06:00:00 100.86  2     3       1
1   2018-05-15 07:00:00 101.99  2     3       1
2   2018-05-15 08:00:00 110.00  Nan   3       Nan
3   2018-05-15 09:00:00 95.49   Nan   4       Nan
4   2018-05-15 10:00:00 92.32   Nan   Nan     Nan

抱歉，我知道这不是最好的例子。我目前的想法是使用 idxmax 进行分组，但我不确定该怎么做，除非使用 while/for 循环。

它是为 ML 项目标记编码，所以如果有一个很好的方法来执行此操作，并使用矢量化，那就太好了谢谢

Answer 1

怎么样：

df = pd.DataFrame({'date': pd.to_datetime(['2018-05-15 06:00:00', '2018-05-15 07:00:00', '2018-05-15 08:00:00',
                            '2018-05-15 09:00:00', '2018-05-15 10:00:00']), 
                   'value': [100.86, 101.99, 110.00, 95.49, 92.32]})

df['Over_5'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
                        .where(lambda v: v > 5).first_valid_index(), axis=1)
df['Under_3'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
                         .where(lambda v: v < -3).first_valid_index(), axis=1)

df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)

print(df)

# Prints:
                 date   value  Over_5  Under_3  Bool
0 2018-05-15 06:00:00  100.86     2.0      3.0   1.0
1 2018-05-15 07:00:00  101.99     2.0      3.0   1.0
2 2018-05-15 08:00:00  110.00     NaN      3.0   NaN
3 2018-05-15 09:00:00   95.49     NaN      4.0   NaN
4 2018-05-15 10:00:00   92.32     NaN      NaN   NaN

一些细节：

# Gets the values that come after the current row.
df[df.index>=row.name].value

# Function that subtracts the value of the current row.
lambda row: (df[df.index>=row.name].value - row.value)

# Function that gets the first index where the difference is > 5
f = lambda row: (df[df.index>=row.name].value - row.value).where(lambda v: v > 5).first_valid_index()

# Apply the function of the previous line to every row.
df.apply(f, axis=1)

Numpy 替代方案：

numpy 的替代方案应该更快：

import numpy as np 

arr = df['value'].to_numpy()
# Calculate all differences between values at once.
# Then, take the upper triangular matrix (i.e., only differences with values coming afterwards).
all_diffs = np.triu(arr - arr[:, None])

# Check for the first index where the condition is fulfilled.
df['Over_5'] = np.argmax(all_diffs > 5, axis=1)
df['Under_3'] = np.argmax(all_diffs < -3, axis=1) 
df[['Over_5', 'Under_3']] = df[['Over_5', 'Under_3']].replace(0, np.nan)

df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)

print(df)

                 date   value  Over_5  Under_3  Bool
0 2018-05-15 06:00:00  100.86     2.0      3.0   1.0
1 2018-05-15 07:00:00  101.99     2.0      3.0   1.0
2 2018-05-15 08:00:00  110.00     NaN      3.0   NaN
3 2018-05-15 09:00:00   95.49     NaN      4.0   NaN
4 2018-05-15 10:00:00   92.32     NaN      NaN   NaN

运行时差异：

# 2.92 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit calc_pandas()

# 587 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit calc_numpy()

因此，如您所见，numpy 版本对于您的示例大约快 6 倍，对于更大的数据集可能更快。它需要更多的内存来计算矩阵，但它应该没问题，除非你有一个非常大的数据框。

在 Pandas 中，如何 return 下一个值的 id 是 above/below 阈值

In Pandas, how to return the id for the next value which is above/below a threshold

vectorization

pandas

Numpy 替代方案：

运行时差异：