pandas 如果股票数据仅在数据框中位于特定时间之间,则合并股票数据
pandas combine stock data if it falls between specific time only in dataframe
我有 2017 年到 2019 年每分钟的股票数据。
我只想保留每天 9:16 之后的数据
因此我想将 9:00 到 9:16 之间的任何数据转换为 9:16 的值
即:
09:16的值应该是
open
:来自 9:00 - 9:16 的第一个数据的值,此处为 116.00
high
:9:00 - 9:16 的最大值,此处为 117.00
low
:9:00 - 9:16 的最低值,此处为 116.00
close
:这将是 9:16 处的值,此处为 113.00
open high low close
date
2017-01-02 09:08:00 116.00 116.00 116.00 116.00
2017-01-02 09:16:00 116.10 117.80 117.00 113.00
2017-01-02 09:17:00 115.50 116.20 115.50 116.20
2017-01-02 09:18:00 116.05 116.35 116.00 116.00
2017-01-02 09:19:00 116.00 116.00 115.60 115.75
... ... ... ... ...
2029-12-29 15:56:00 259.35 259.35 259.35 259.35
2019-12-29 15:57:00 260.00 260.00 260.00 260.00
2019-12-29 15:58:00 260.00 260.00 259.35 259.35
2019-12-29 15:59:00 260.00 260.00 260.00 260.00
2019-12-29 16:36:00 259.35 259.35 259.35 259.35
这是我尝试过的:
#Get data from/to 9:00 - 9:16 and create only one data item
convertPreTrade = df.between_time("09:00", "09:16") #09:00 - 09:16
#combine modified value to original data
df.loc[df.index.strftime("%H:%M") == "09:16" ,
["open","high","low","close"] ] = [convertPreTrade["open"][0],
convertPreTrade["high"].max(),
convertPreTrade["low"].min(),
convertPreTrade['close'][-1] ]
但这不会给我准确的数据
从 9:00 提取到 9:16。数据框按年、月和日分组,并根据 OHLC 值进行计算。该逻辑使用您的代码。最后,您在 9:16 处添加一个日期列。由于我们没有所有数据,我们可能遗漏了一些注意事项,但基本形式保持不变。
import pandas as pd
import numpy as np
import io
data = '''
date open high low close
"2017-01-02 09:08:00" 116.00 116.00 116.00 116.00
"2017-01-02 09:16:00" 116.10 117.80 117.00 113.00
"2017-01-02 09:17:00" 115.50 116.20 115.50 116.20
"2017-01-02 09:18:00" 116.05 116.35 116.00 116.00
"2017-01-02 09:19:00" 116.00 116.00 115.60 115.75
"2017-01-03 09:08:00" 259.35 259.35 259.35 259.35
"2017-01-03 09:09:00" 260.00 260.00 260.00 260.00
"2017-12-03 09:18:00" 260.00 260.00 259.35 259.35
"2017-12-04 09:05:00" 260.00 260.00 260.00 260.00
"2017-12-04 09:22:00" 259.35 259.35 259.35 259.35
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df.reset_index(drop=True, inplace=True)
df['date'] = pd.to_datetime(df['date'])
# 9:00-9:16
df_start = df[((df['date'].dt.hour == 9) & (df['date'].dt.minute >= 0)) & ((df['date'].dt.hour == 9) & (df['date'].dt.minute <=16))]
# calculate
df_new = (df_start.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day])
.agg(open_first=('open', lambda x: x.iloc[0,]),
high_max=('high','max'),
low_min=('low', 'min'),
close_shift=('close', lambda x: x.iloc[-1,])))
df_new.index.names = ['year', 'month', 'day']
df_new.reset_index(inplace=True)
df_new['date'] = df_new['year'].astype(str)+'-'+df_new['month'].astype(str)+'-'+df_new['day'].astype(str)+' 09:16:00'
year month day open_first high_max low_min close_shift date
0 2017 1 2 116.00 117.8 116.00 113.0 2017-1-2 09:16:00
1 2017 1 3 259.35 260.0 259.35 260.0 2017-1-3 09:16:00
2 2017 12 4 260.00 260.0 260.00 260.0 2017-12-4 09:16:00
利用@r-beginners数据并添加了额外的几行:
import pandas as pd
import numpy as np
import io
data = '''
datetime open high low close
"2017-01-02 09:08:00" 116.00 116.00 116.00 116.00
"2017-01-02 09:16:00" 116.10 117.80 117.00 113.00
"2017-01-02 09:17:00" 115.50 116.20 115.50 116.20
"2017-01-02 09:18:00" 116.05 116.35 116.00 116.00
"2017-01-02 09:19:00" 116.00 116.00 115.60 115.75
"2017-01-03 09:08:00" 259.35 259.35 259.35 259.35
"2017-01-03 09:09:00" 260.00 260.00 260.00 260.00
"2017-01-03 09:16:00" 260.00 260.00 260.00 260.00
"2017-01-03 09:17:00" 261.00 261.00 261.00 261.00
"2017-01-03 09:18:00" 262.00 262.00 262.00 262.00
"2017-12-03 09:18:00" 260.00 260.00 259.35 259.35
"2017-12-04 09:05:00" 260.00 260.00 260.00 260.00
"2017-12-04 09:22:00" 259.35 259.35 259.35 259.35
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
下面的代码开始了整个过程。可能不是最好的方法,但是又快又脏:
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df['date'] = df.index.date
dates = np.unique(df.index.date)
first_rows = df.between_time('9:16', '00:00').reset_index().groupby('date').first().set_index('datetime')
first_rows['date'] = first_rows.index.date
dffs = []
for d in dates:
df_day = df[df['date'] == d].sort_index()
first_bar_of_the_day = first_rows[first_rows['date'] == d].copy()
bars_until_first = df_day.loc[df_day.index <= first_bar_of_the_day.index.values[0]]
if ~first_bar_of_the_day.empty:
first_bar_of_the_day['open'] = bars_until_first['open'].values[0]
first_bar_of_the_day['high'] = bars_until_first['high'].max()
first_bar_of_the_day['low'] = bars_until_first['low'].min()
first_bar_of_the_day['close'] = bars_until_first['close'].values[-1]
bars_after_first = df_day.loc[df_day.index > first_bar_of_the_day.index.values[0]]
if len(bars_after_first) > 1:
dff = pd.concat([first_bar_of_the_day, bars_after_first])
else:
dff = first_bar_of_the_day.copy()
print(dff)
dffs.append(dff)
combined_df = pd.concat([x for x in dffs])
print(combined_df)
打印结果如下:dff
不同日期
open high low close date
datetime
2017-01-02 09:16:00 116.00 117.80 116.0 113.00 2017-01-02
2017-01-02 09:17:00 115.50 116.20 115.5 116.20 2017-01-02
2017-01-02 09:18:00 116.05 116.35 116.0 116.00 2017-01-02
2017-01-02 09:19:00 116.00 116.00 115.6 115.75 2017-01-02
open high low close date
datetime
2017-01-03 09:16:00 259.35 260.0 259.35 260.0 2017-01-03
2017-01-03 09:17:00 261.00 261.0 261.00 261.0 2017-01-03
2017-01-03 09:18:00 262.00 262.0 262.00 262.0 2017-01-03
open high low close date
datetime
2017-12-03 09:18:00 260.0 260.0 259.35 259.35 2017-12-03
open high low close date
datetime
2017-12-04 09:22:00 260.0 260.0 259.35 259.35 2017-12-04
combined_df
open high low close date
datetime
2017-01-02 09:16:00 116.00 117.80 116.00 113.00 2017-01-02
2017-01-02 09:17:00 115.50 116.20 115.50 116.20 2017-01-02
2017-01-02 09:18:00 116.05 116.35 116.00 116.00 2017-01-02
2017-01-02 09:19:00 116.00 116.00 115.60 115.75 2017-01-02
2017-01-03 09:16:00 259.35 260.00 259.35 260.00 2017-01-03
2017-01-03 09:17:00 261.00 261.00 261.00 261.00 2017-01-03
2017-01-03 09:18:00 262.00 262.00 262.00 262.00 2017-01-03
2017-12-03 09:18:00 260.00 260.00 259.35 259.35 2017-12-03
2017-12-04 09:22:00 260.00 260.00 259.35 259.35 2017-12-04
旁注:我不太确定你清除数据的方式是否最好,也许你可以看看是否完全忽略每天9:16am之前的时间,或者甚至做一个分析来检查出波动率前15分钟决定。
d = {'date': 'last', 'open': 'last',
'high': 'max', 'low': 'min', 'close': 'last'}
# df.index = pd.to_datetime(df.index)
s1 = df.between_time('09:00:00', '09:16:00')
s2 = s1.reset_index().groupby(s1.index.date).agg(d).set_index('date')
df1 = pd.concat([df.drop(s1.index), s2]).sort_index()
详情:
使用DataFrame.between_time
过滤数据帧df
中时间09:00
到09:16
之间的行:
print(s1)
open high low close
date
2017-01-02 09:08:00 116.0 116.0 116.0 116.0
2017-01-02 09:16:00 116.1 117.8 117.0 113.0
使用 DataFrame.groupby
在 date
上对过滤后的数据帧 s1
进行分组,并使用字典 d
:
进行聚合
print(s2)
open high low close
date
2017-01-02 09:16:00 116.1 117.8 116.0 113.0
使用DataFrame.drop
to drop the rows from the original datframe df
that falls between the time 09:00-09:16
, then use pd.concat
to concat it with s2
, finally use DataFrame.sort_index
排序索引:
print(df1)
open high low close
date
2017-01-02 09:16:00 116.10 117.80 116.00 113.00
2017-01-02 09:17:00 115.50 116.20 115.50 116.20
2017-01-02 09:18:00 116.05 116.35 116.00 116.00
2017-01-02 09:19:00 116.00 116.00 115.60 115.75
2019-12-29 15:57:00 260.00 260.00 260.00 260.00
2019-12-29 15:58:00 260.00 260.00 259.35 259.35
2019-12-29 15:59:00 260.00 260.00 260.00 260.00
2019-12-29 16:36:00 259.35 259.35 259.35 259.35
2029-12-29 15:56:00 259.35 259.35 259.35 259.35
我有 2017 年到 2019 年每分钟的股票数据。 我只想保留每天 9:16 之后的数据 因此我想将 9:00 到 9:16 之间的任何数据转换为 9:16 的值 即:
09:16的值应该是
open
:来自 9:00 - 9:16 的第一个数据的值,此处为 116.00high
:9:00 - 9:16 的最大值,此处为 117.00low
:9:00 - 9:16 的最低值,此处为 116.00close
:这将是 9:16 处的值,此处为 113.00
open high low close
date
2017-01-02 09:08:00 116.00 116.00 116.00 116.00
2017-01-02 09:16:00 116.10 117.80 117.00 113.00
2017-01-02 09:17:00 115.50 116.20 115.50 116.20
2017-01-02 09:18:00 116.05 116.35 116.00 116.00
2017-01-02 09:19:00 116.00 116.00 115.60 115.75
... ... ... ... ...
2029-12-29 15:56:00 259.35 259.35 259.35 259.35
2019-12-29 15:57:00 260.00 260.00 260.00 260.00
2019-12-29 15:58:00 260.00 260.00 259.35 259.35
2019-12-29 15:59:00 260.00 260.00 260.00 260.00
2019-12-29 16:36:00 259.35 259.35 259.35 259.35
这是我尝试过的:
#Get data from/to 9:00 - 9:16 and create only one data item
convertPreTrade = df.between_time("09:00", "09:16") #09:00 - 09:16
#combine modified value to original data
df.loc[df.index.strftime("%H:%M") == "09:16" ,
["open","high","low","close"] ] = [convertPreTrade["open"][0],
convertPreTrade["high"].max(),
convertPreTrade["low"].min(),
convertPreTrade['close'][-1] ]
但这不会给我准确的数据
从 9:00 提取到 9:16。数据框按年、月和日分组,并根据 OHLC 值进行计算。该逻辑使用您的代码。最后,您在 9:16 处添加一个日期列。由于我们没有所有数据,我们可能遗漏了一些注意事项,但基本形式保持不变。
import pandas as pd
import numpy as np
import io
data = '''
date open high low close
"2017-01-02 09:08:00" 116.00 116.00 116.00 116.00
"2017-01-02 09:16:00" 116.10 117.80 117.00 113.00
"2017-01-02 09:17:00" 115.50 116.20 115.50 116.20
"2017-01-02 09:18:00" 116.05 116.35 116.00 116.00
"2017-01-02 09:19:00" 116.00 116.00 115.60 115.75
"2017-01-03 09:08:00" 259.35 259.35 259.35 259.35
"2017-01-03 09:09:00" 260.00 260.00 260.00 260.00
"2017-12-03 09:18:00" 260.00 260.00 259.35 259.35
"2017-12-04 09:05:00" 260.00 260.00 260.00 260.00
"2017-12-04 09:22:00" 259.35 259.35 259.35 259.35
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df.reset_index(drop=True, inplace=True)
df['date'] = pd.to_datetime(df['date'])
# 9:00-9:16
df_start = df[((df['date'].dt.hour == 9) & (df['date'].dt.minute >= 0)) & ((df['date'].dt.hour == 9) & (df['date'].dt.minute <=16))]
# calculate
df_new = (df_start.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day])
.agg(open_first=('open', lambda x: x.iloc[0,]),
high_max=('high','max'),
low_min=('low', 'min'),
close_shift=('close', lambda x: x.iloc[-1,])))
df_new.index.names = ['year', 'month', 'day']
df_new.reset_index(inplace=True)
df_new['date'] = df_new['year'].astype(str)+'-'+df_new['month'].astype(str)+'-'+df_new['day'].astype(str)+' 09:16:00'
year month day open_first high_max low_min close_shift date
0 2017 1 2 116.00 117.8 116.00 113.0 2017-1-2 09:16:00
1 2017 1 3 259.35 260.0 259.35 260.0 2017-1-3 09:16:00
2 2017 12 4 260.00 260.0 260.00 260.0 2017-12-4 09:16:00
利用@r-beginners数据并添加了额外的几行:
import pandas as pd
import numpy as np
import io
data = '''
datetime open high low close
"2017-01-02 09:08:00" 116.00 116.00 116.00 116.00
"2017-01-02 09:16:00" 116.10 117.80 117.00 113.00
"2017-01-02 09:17:00" 115.50 116.20 115.50 116.20
"2017-01-02 09:18:00" 116.05 116.35 116.00 116.00
"2017-01-02 09:19:00" 116.00 116.00 115.60 115.75
"2017-01-03 09:08:00" 259.35 259.35 259.35 259.35
"2017-01-03 09:09:00" 260.00 260.00 260.00 260.00
"2017-01-03 09:16:00" 260.00 260.00 260.00 260.00
"2017-01-03 09:17:00" 261.00 261.00 261.00 261.00
"2017-01-03 09:18:00" 262.00 262.00 262.00 262.00
"2017-12-03 09:18:00" 260.00 260.00 259.35 259.35
"2017-12-04 09:05:00" 260.00 260.00 260.00 260.00
"2017-12-04 09:22:00" 259.35 259.35 259.35 259.35
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
下面的代码开始了整个过程。可能不是最好的方法,但是又快又脏:
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df['date'] = df.index.date
dates = np.unique(df.index.date)
first_rows = df.between_time('9:16', '00:00').reset_index().groupby('date').first().set_index('datetime')
first_rows['date'] = first_rows.index.date
dffs = []
for d in dates:
df_day = df[df['date'] == d].sort_index()
first_bar_of_the_day = first_rows[first_rows['date'] == d].copy()
bars_until_first = df_day.loc[df_day.index <= first_bar_of_the_day.index.values[0]]
if ~first_bar_of_the_day.empty:
first_bar_of_the_day['open'] = bars_until_first['open'].values[0]
first_bar_of_the_day['high'] = bars_until_first['high'].max()
first_bar_of_the_day['low'] = bars_until_first['low'].min()
first_bar_of_the_day['close'] = bars_until_first['close'].values[-1]
bars_after_first = df_day.loc[df_day.index > first_bar_of_the_day.index.values[0]]
if len(bars_after_first) > 1:
dff = pd.concat([first_bar_of_the_day, bars_after_first])
else:
dff = first_bar_of_the_day.copy()
print(dff)
dffs.append(dff)
combined_df = pd.concat([x for x in dffs])
print(combined_df)
打印结果如下:dff
不同日期
open high low close date
datetime
2017-01-02 09:16:00 116.00 117.80 116.0 113.00 2017-01-02
2017-01-02 09:17:00 115.50 116.20 115.5 116.20 2017-01-02
2017-01-02 09:18:00 116.05 116.35 116.0 116.00 2017-01-02
2017-01-02 09:19:00 116.00 116.00 115.6 115.75 2017-01-02
open high low close date
datetime
2017-01-03 09:16:00 259.35 260.0 259.35 260.0 2017-01-03
2017-01-03 09:17:00 261.00 261.0 261.00 261.0 2017-01-03
2017-01-03 09:18:00 262.00 262.0 262.00 262.0 2017-01-03
open high low close date
datetime
2017-12-03 09:18:00 260.0 260.0 259.35 259.35 2017-12-03
open high low close date
datetime
2017-12-04 09:22:00 260.0 260.0 259.35 259.35 2017-12-04
combined_df
open high low close date
datetime
2017-01-02 09:16:00 116.00 117.80 116.00 113.00 2017-01-02
2017-01-02 09:17:00 115.50 116.20 115.50 116.20 2017-01-02
2017-01-02 09:18:00 116.05 116.35 116.00 116.00 2017-01-02
2017-01-02 09:19:00 116.00 116.00 115.60 115.75 2017-01-02
2017-01-03 09:16:00 259.35 260.00 259.35 260.00 2017-01-03
2017-01-03 09:17:00 261.00 261.00 261.00 261.00 2017-01-03
2017-01-03 09:18:00 262.00 262.00 262.00 262.00 2017-01-03
2017-12-03 09:18:00 260.00 260.00 259.35 259.35 2017-12-03
2017-12-04 09:22:00 260.00 260.00 259.35 259.35 2017-12-04
旁注:我不太确定你清除数据的方式是否最好,也许你可以看看是否完全忽略每天9:16am之前的时间,或者甚至做一个分析来检查出波动率前15分钟决定。
d = {'date': 'last', 'open': 'last',
'high': 'max', 'low': 'min', 'close': 'last'}
# df.index = pd.to_datetime(df.index)
s1 = df.between_time('09:00:00', '09:16:00')
s2 = s1.reset_index().groupby(s1.index.date).agg(d).set_index('date')
df1 = pd.concat([df.drop(s1.index), s2]).sort_index()
详情:
使用DataFrame.between_time
过滤数据帧df
中时间09:00
到09:16
之间的行:
print(s1)
open high low close
date
2017-01-02 09:08:00 116.0 116.0 116.0 116.0
2017-01-02 09:16:00 116.1 117.8 117.0 113.0
使用 DataFrame.groupby
在 date
上对过滤后的数据帧 s1
进行分组,并使用字典 d
:
print(s2)
open high low close
date
2017-01-02 09:16:00 116.1 117.8 116.0 113.0
使用DataFrame.drop
to drop the rows from the original datframe df
that falls between the time 09:00-09:16
, then use pd.concat
to concat it with s2
, finally use DataFrame.sort_index
排序索引:
print(df1)
open high low close
date
2017-01-02 09:16:00 116.10 117.80 116.00 113.00
2017-01-02 09:17:00 115.50 116.20 115.50 116.20
2017-01-02 09:18:00 116.05 116.35 116.00 116.00
2017-01-02 09:19:00 116.00 116.00 115.60 115.75
2019-12-29 15:57:00 260.00 260.00 260.00 260.00
2019-12-29 15:58:00 260.00 260.00 259.35 259.35
2019-12-29 15:59:00 260.00 260.00 260.00 260.00
2019-12-29 16:36:00 259.35 259.35 259.35 259.35
2029-12-29 15:56:00 259.35 259.35 259.35 259.35