根据数值有条件地创建数据框列
Conditional creation of a dataframe column based on numeric values
我有一个 pandas 数据帧时间序列(大约 1000 行和下面的四列),如下所示:
Date Values Avg +1 Stdev
01/01/2010 1.01 1.00 1.05
02/01/2010 1.02 1.00 1.05
03/01/2010 1.04 1.00 1.05
04/01/2010 -0.97 1.00 1.05
05/01/2010 1.12 1.00 1.05
06/01/2010 1.08 1.00 1.05
....
我想做的是创建第五列(称为 'Trigger Date'),如果第 2 列中的值超出第 4 列中设置的阈值,则新列 return s 日期(来自索引列),否则没有值是 returned。
这里的附加约束是,如果先前的值已经超过第 4 列中的阈值,则第五列也不应 return 日期。
换句话说,问题的伪代码是:
If df['Values'] > df['+1 Stdev']
AND
If df['Values'] (for the row above) < df['+1 Stdev']
THEN
Return df['Date'] in new column df['Trigger Date']
ELSE
Leave row in df['Trigger Date'] blank
如能提供解决此问题的任何帮助,我们将不胜感激
编辑:附加问题 - 添加第三个约束的任何方式,如果触发日期在过去 XX 天(例如过去 30 天)已经发生,则没有触发日期 returned?所以预期看起来像:
Date Values Avg +1 Stdev Trigger Date
0 01/01/2010 1.01 1.0 1.05 NaN
1 02/01/2010 1.02 1.0 1.05 NaN
2 03/01/2010 1.04 1.0 1.05 NaN
3 04/01/2010 -0.97 1.0 1.05 NaN
4 05/01/2010 1.12 1.0 1.05 05/01/2010
5 06/01/2010 1.08 1.0 1.05 NaN
6 07/01/2010 1.03 1.0 1.05 NaN
7 08/01/2010 1.07 1.0 1.05 NaN <- above threshold, but trigger occurred within last 30 days so don't return date
...
50 20/02/2010 1.12 1.0 1.05 20/02/2010 <- more than 30 days later, no trigger dates in between, so return date
对行上方的值使用 numpy.where
with shift
:
m1 = df['Values'] > df['+1 Stdev']
m2 = df['Values'].shift() < df['+1 Stdev']
df['Trigger Date'] = np.where(m1 & m2, df['Date'], np.nan)
print (df)
Date Values Avg +1 Stdev Trigger Date
0 01/01/2010 1.01 1.0 1.05 NaN
1 02/01/2010 1.02 1.0 1.05 NaN
2 03/01/2010 1.04 1.0 1.05 NaN
3 04/01/2010 -0.97 1.0 1.05 NaN
4 05/01/2010 1.12 1.0 1.05 05/01/2010
5 06/01/2010 1.08 1.0 1.05 NaN
编辑:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
m1 = df['Values'] > df['+1 Stdev']
m2 = df['Values'].shift() < df['+1 Stdev']
a = df['Date'] - pd.Timedelta(30, unit='d')
L = [df['Date'].shift(-1).isin(pd.date_range(x, y, freq='d')) for x, y in zip(a, df['Date'] )]
m3 = np.logical_or.reduce(L)
mask = (m1 & m2) | ~m3
df.loc[mask, 'Trigger Date'] = df['Date']
print (df)
Date Values Avg +1 Stdev Trigger Date
0 2010-01-01 1.01 1.0 1.05 NaT
1 2010-01-02 1.02 1.0 1.05 NaT
2 2010-01-03 1.04 1.0 1.05 NaT
3 2010-01-04 -0.97 1.0 1.05 NaT
4 2010-01-05 1.12 1.0 1.05 2010-01-05
5 2010-01-06 1.08 1.0 1.05 NaT
6 2010-02-20 1.12 1.0 1.05 2010-02-20
我有一个 pandas 数据帧时间序列(大约 1000 行和下面的四列),如下所示:
Date Values Avg +1 Stdev
01/01/2010 1.01 1.00 1.05
02/01/2010 1.02 1.00 1.05
03/01/2010 1.04 1.00 1.05
04/01/2010 -0.97 1.00 1.05
05/01/2010 1.12 1.00 1.05
06/01/2010 1.08 1.00 1.05
....
我想做的是创建第五列(称为 'Trigger Date'),如果第 2 列中的值超出第 4 列中设置的阈值,则新列 return s 日期(来自索引列),否则没有值是 returned。 这里的附加约束是,如果先前的值已经超过第 4 列中的阈值,则第五列也不应 return 日期。
换句话说,问题的伪代码是:
If df['Values'] > df['+1 Stdev']
AND
If df['Values'] (for the row above) < df['+1 Stdev']
THEN
Return df['Date'] in new column df['Trigger Date']
ELSE
Leave row in df['Trigger Date'] blank
如能提供解决此问题的任何帮助,我们将不胜感激
编辑:附加问题 - 添加第三个约束的任何方式,如果触发日期在过去 XX 天(例如过去 30 天)已经发生,则没有触发日期 returned?所以预期看起来像:
Date Values Avg +1 Stdev Trigger Date
0 01/01/2010 1.01 1.0 1.05 NaN
1 02/01/2010 1.02 1.0 1.05 NaN
2 03/01/2010 1.04 1.0 1.05 NaN
3 04/01/2010 -0.97 1.0 1.05 NaN
4 05/01/2010 1.12 1.0 1.05 05/01/2010
5 06/01/2010 1.08 1.0 1.05 NaN
6 07/01/2010 1.03 1.0 1.05 NaN
7 08/01/2010 1.07 1.0 1.05 NaN <- above threshold, but trigger occurred within last 30 days so don't return date
...
50 20/02/2010 1.12 1.0 1.05 20/02/2010 <- more than 30 days later, no trigger dates in between, so return date
对行上方的值使用 numpy.where
with shift
:
m1 = df['Values'] > df['+1 Stdev']
m2 = df['Values'].shift() < df['+1 Stdev']
df['Trigger Date'] = np.where(m1 & m2, df['Date'], np.nan)
print (df)
Date Values Avg +1 Stdev Trigger Date
0 01/01/2010 1.01 1.0 1.05 NaN
1 02/01/2010 1.02 1.0 1.05 NaN
2 03/01/2010 1.04 1.0 1.05 NaN
3 04/01/2010 -0.97 1.0 1.05 NaN
4 05/01/2010 1.12 1.0 1.05 05/01/2010
5 06/01/2010 1.08 1.0 1.05 NaN
编辑:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
m1 = df['Values'] > df['+1 Stdev']
m2 = df['Values'].shift() < df['+1 Stdev']
a = df['Date'] - pd.Timedelta(30, unit='d')
L = [df['Date'].shift(-1).isin(pd.date_range(x, y, freq='d')) for x, y in zip(a, df['Date'] )]
m3 = np.logical_or.reduce(L)
mask = (m1 & m2) | ~m3
df.loc[mask, 'Trigger Date'] = df['Date']
print (df)
Date Values Avg +1 Stdev Trigger Date
0 2010-01-01 1.01 1.0 1.05 NaT
1 2010-01-02 1.02 1.0 1.05 NaT
2 2010-01-03 1.04 1.0 1.05 NaT
3 2010-01-04 -0.97 1.0 1.05 NaT
4 2010-01-05 1.12 1.0 1.05 2010-01-05
5 2010-01-06 1.08 1.0 1.05 NaT
6 2010-02-20 1.12 1.0 1.05 2010-02-20