在 Pandas 中,如何 return 下一个值的 id 是 above/below 阈值
In Pandas, how to return the id for the next value which is above/below a threshold
我有一个这样的数据框:
date value
0 2018-05-15 06:00:00 100.86
1 2018-05-15 07:00:00 101.99
2 2018-05-15 08:00:00 110.00
3 2018-05-15 09:00:00 95.49
4 2018-05-15 10:00:00 92.32
我想创建一个新列,告诉我下一个值的索引,该值高于或低于一定数量。例如 5 岁以上,3 岁以下
date value Over_5 Under_3
0 2018-05-15 06:00:00 100.86 2 3
1 2018-05-15 07:00:00 101.99 2 3
2 2018-05-15 08:00:00 110.00 Nan 3
3 2018-05-15 09:00:00 95.49 Nan 4
4 2018-05-15 10:00:00 92.32 Nan Nan
理想情况下,我想对 return 使用布尔运算符,哪个先出现,大于 5 (0) 还是小于 3(1)?
date value Over_5 Under_3 Bool
0 2018-05-15 06:00:00 100.86 2 3 1
1 2018-05-15 07:00:00 101.99 2 3 1
2 2018-05-15 08:00:00 110.00 Nan 3 Nan
3 2018-05-15 09:00:00 95.49 Nan 4 Nan
4 2018-05-15 10:00:00 92.32 Nan Nan Nan
抱歉,我知道这不是最好的例子。我目前的想法是使用 idxmax 进行分组,但我不确定该怎么做,除非使用 while/for 循环。
它是为 ML 项目标记编码,所以如果有一个很好的方法来执行此操作,并使用矢量化,那就太好了
谢谢
怎么样:
df = pd.DataFrame({'date': pd.to_datetime(['2018-05-15 06:00:00', '2018-05-15 07:00:00', '2018-05-15 08:00:00',
'2018-05-15 09:00:00', '2018-05-15 10:00:00']),
'value': [100.86, 101.99, 110.00, 95.49, 92.32]})
df['Over_5'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
.where(lambda v: v > 5).first_valid_index(), axis=1)
df['Under_3'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
.where(lambda v: v < -3).first_valid_index(), axis=1)
df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)
print(df)
# Prints:
date value Over_5 Under_3 Bool
0 2018-05-15 06:00:00 100.86 2.0 3.0 1.0
1 2018-05-15 07:00:00 101.99 2.0 3.0 1.0
2 2018-05-15 08:00:00 110.00 NaN 3.0 NaN
3 2018-05-15 09:00:00 95.49 NaN 4.0 NaN
4 2018-05-15 10:00:00 92.32 NaN NaN NaN
一些细节:
# Gets the values that come after the current row.
df[df.index>=row.name].value
# Function that subtracts the value of the current row.
lambda row: (df[df.index>=row.name].value - row.value)
# Function that gets the first index where the difference is > 5
f = lambda row: (df[df.index>=row.name].value - row.value).where(lambda v: v > 5).first_valid_index()
# Apply the function of the previous line to every row.
df.apply(f, axis=1)
Numpy 替代方案:
numpy
的替代方案应该更快:
import numpy as np
arr = df['value'].to_numpy()
# Calculate all differences between values at once.
# Then, take the upper triangular matrix (i.e., only differences with values coming afterwards).
all_diffs = np.triu(arr - arr[:, None])
# Check for the first index where the condition is fulfilled.
df['Over_5'] = np.argmax(all_diffs > 5, axis=1)
df['Under_3'] = np.argmax(all_diffs < -3, axis=1)
df[['Over_5', 'Under_3']] = df[['Over_5', 'Under_3']].replace(0, np.nan)
df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)
print(df)
date value Over_5 Under_3 Bool
0 2018-05-15 06:00:00 100.86 2.0 3.0 1.0
1 2018-05-15 07:00:00 101.99 2.0 3.0 1.0
2 2018-05-15 08:00:00 110.00 NaN 3.0 NaN
3 2018-05-15 09:00:00 95.49 NaN 4.0 NaN
4 2018-05-15 10:00:00 92.32 NaN NaN NaN
运行时差异:
# 2.92 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit calc_pandas()
# 587 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit calc_numpy()
因此,如您所见,numpy
版本对于您的示例大约快 6 倍,对于更大的数据集可能更快。它需要更多的内存来计算矩阵,但它应该没问题,除非你有一个非常大的数据框。
我有一个这样的数据框:
date value
0 2018-05-15 06:00:00 100.86
1 2018-05-15 07:00:00 101.99
2 2018-05-15 08:00:00 110.00
3 2018-05-15 09:00:00 95.49
4 2018-05-15 10:00:00 92.32
我想创建一个新列,告诉我下一个值的索引,该值高于或低于一定数量。例如 5 岁以上,3 岁以下
date value Over_5 Under_3
0 2018-05-15 06:00:00 100.86 2 3
1 2018-05-15 07:00:00 101.99 2 3
2 2018-05-15 08:00:00 110.00 Nan 3
3 2018-05-15 09:00:00 95.49 Nan 4
4 2018-05-15 10:00:00 92.32 Nan Nan
理想情况下,我想对 return 使用布尔运算符,哪个先出现,大于 5 (0) 还是小于 3(1)?
date value Over_5 Under_3 Bool
0 2018-05-15 06:00:00 100.86 2 3 1
1 2018-05-15 07:00:00 101.99 2 3 1
2 2018-05-15 08:00:00 110.00 Nan 3 Nan
3 2018-05-15 09:00:00 95.49 Nan 4 Nan
4 2018-05-15 10:00:00 92.32 Nan Nan Nan
抱歉,我知道这不是最好的例子。我目前的想法是使用 idxmax 进行分组,但我不确定该怎么做,除非使用 while/for 循环。
它是为 ML 项目标记编码,所以如果有一个很好的方法来执行此操作,并使用矢量化,那就太好了 谢谢
怎么样:
df = pd.DataFrame({'date': pd.to_datetime(['2018-05-15 06:00:00', '2018-05-15 07:00:00', '2018-05-15 08:00:00',
'2018-05-15 09:00:00', '2018-05-15 10:00:00']),
'value': [100.86, 101.99, 110.00, 95.49, 92.32]})
df['Over_5'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
.where(lambda v: v > 5).first_valid_index(), axis=1)
df['Under_3'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
.where(lambda v: v < -3).first_valid_index(), axis=1)
df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)
print(df)
# Prints:
date value Over_5 Under_3 Bool
0 2018-05-15 06:00:00 100.86 2.0 3.0 1.0
1 2018-05-15 07:00:00 101.99 2.0 3.0 1.0
2 2018-05-15 08:00:00 110.00 NaN 3.0 NaN
3 2018-05-15 09:00:00 95.49 NaN 4.0 NaN
4 2018-05-15 10:00:00 92.32 NaN NaN NaN
一些细节:
# Gets the values that come after the current row.
df[df.index>=row.name].value
# Function that subtracts the value of the current row.
lambda row: (df[df.index>=row.name].value - row.value)
# Function that gets the first index where the difference is > 5
f = lambda row: (df[df.index>=row.name].value - row.value).where(lambda v: v > 5).first_valid_index()
# Apply the function of the previous line to every row.
df.apply(f, axis=1)
Numpy 替代方案:
numpy
的替代方案应该更快:
import numpy as np
arr = df['value'].to_numpy()
# Calculate all differences between values at once.
# Then, take the upper triangular matrix (i.e., only differences with values coming afterwards).
all_diffs = np.triu(arr - arr[:, None])
# Check for the first index where the condition is fulfilled.
df['Over_5'] = np.argmax(all_diffs > 5, axis=1)
df['Under_3'] = np.argmax(all_diffs < -3, axis=1)
df[['Over_5', 'Under_3']] = df[['Over_5', 'Under_3']].replace(0, np.nan)
df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)
print(df)
date value Over_5 Under_3 Bool
0 2018-05-15 06:00:00 100.86 2.0 3.0 1.0
1 2018-05-15 07:00:00 101.99 2.0 3.0 1.0
2 2018-05-15 08:00:00 110.00 NaN 3.0 NaN
3 2018-05-15 09:00:00 95.49 NaN 4.0 NaN
4 2018-05-15 10:00:00 92.32 NaN NaN NaN
运行时差异:
# 2.92 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit calc_pandas()
# 587 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit calc_numpy()
因此,如您所见,numpy
版本对于您的示例大约快 6 倍,对于更大的数据集可能更快。它需要更多的内存来计算矩阵,但它应该没问题,除非你有一个非常大的数据框。