如何使用 Python/pandas 获取带有行摘要的 minutes/hours 财务数据?
how do I get the minutes/hours financial data with rows summary using Python/pandas?
假设我有如下分钟的一些财务数据,我想写一个用户自定义函数(下面的代码又丑又复杂),如何获得5-minute/10-minute/30-minute/1 hour/8 hour/24 小时数据,行摘要使用 Python/pandas out of CSV?
TIME OPEN HIGH LOW CLOSE VOLUME
----------------------------------------------
0 1592194620 3046.00 3048.50 3046.00 3047.50 505
1 1592194630 3047.00 3048.00 3046.00 3047.00 162
2 1592194640 3047.50 3048.00 3047.00 3047.50 98
3 1592194650 3047.50 3047.50 3047.00 3047.50 228
4 1592194660 3048.00 3048.00 3047.50 3048.00 136
5 1592194670 3048.00 3048.00 3046.50 3046.50 174
6 1592194680 3046.50 3046.50 3045.00 3045.00 134
7 1592194690 3045.50 3046.00 3044.00 3045.00 43
8 1592194700 3045.00 3045.50 3045.00 3045.00 214
9 1592194710 3045.50 3045.50 3045.50 3045.50 8
10 1592194720 3045.50 3046.00 3044.50 3044.50 152
.......
.......
19999 1591594660 3048.00 3048.00 3047.50 3048.00 136
示例输出如下:
3048.50 2140 2020-06-13 04:34:00
3050.50 67 2020-06-13 04:35:00
3049.50 1489 2020-06-13 04:36:00
3047.50 987 2020-06-13 04:37:00
......
3099.50 2 2020-06-14 04:34:00
下面是我的愚蠢代码:
import pandas as pd
import pymysql
conn = pymysql.connect( host = "localhost",
user="root",
passwd="root",
db="demo")
sql = "SELECT TIME, OPEN, HIGH, LOW, CLOSE, VOLUME FROM demo_table;"
df = pd.read_sql(sql, conn)
# 12 hours for 1000 records
for i in range(1000, 20000-1000,1):
high_price = df.loc[i,['high']][0]
df_1000 = df.loc[i-1000:i]
df_high = df_1000[df_1000['high']>high_price]
high_count = df_high.shape[0]
df_last = df_high.tail(1)
time_dt = pd.Timestamp(df_last['TIME'], unit='s')
print(high_price, high_count, time_dt )
首先我建议阅读 CSV 并将 TIME 设置为索引:
import pandas as pd
import numpy as np
df = pd.read_csv(csv_file, delim_whitespace=True)
df['TIME'] = pd.to_datetime(df['TIME'], unit='s')
df.set_index('TIME', inplace=True)
如果您只是想将时间间隔缩短为另一个时间间隔(例如,从当前的 1 分钟变为 5 分钟),您可以使用 Dataframe.resample 方法轻松地重新采样:
# Tells what the aggregation should do for each column
colls_agg = {'OPEN': lambda x: x.iloc[0],
'HIGH': 'max',
'LOW': 'min',
'CLOSE': lambda x: x.iloc[-1],
'VOLUME': 'sum'}
def get_summary(df, time_interval):
# Tells what the aggregation should do for each column
return df.resample(pd.Timedelta(time_interval)).agg(colls_agg)
如果您希望数据帧的每一行都对应于最近 X 分钟的摘要(我相信这是您想要的),则需要为每一行重新计算它,如下所示。
colls_agg = {'OPEN': lambda x: x.iloc[0],
'HIGH': 'max',
'LOW': 'min',
'CLOSE': lambda x: x.iloc[-1],
'VOLUME': 'sum'}
def recompute_summary_line(line, full_df, time_interval):
"""Recomputes the summary for a line of the dataframe.
line should be a line of the dataframe,
full_df is the full dataframe
time_interval is the interval of time which will be selected"""
# Selects time betwen current time - time_interval
# until current time (including it)
lines_to_select = (full_df.index > line.name - time_interval) & \
(full_df.index <= (line.name))
agg_value = full_df[lines_to_select].agg(colls_agg)
# For the first few lines, this is not possible, so it returns nan
# Since we have included the current time, it will never happen.
# If you do NOT to include the current time, you might use this.
if agg_value.empty:
return pd.Series({'OPEN': np.nan, 'HIGH': np.nan,
'LOW': np.nan, 'VOLUME': np.nan})
return agg_value
def recompute_summary (df, time_interval):
"""Given a dataframe df, recomputes the summary for the
current time of each row using the information from the the previous
interval given in time_interval (for example '5min', '30s')"""
# Use df.apply to apply it in each line of the dataframe
return df.apply(lambda x: recompute_summary_line(
x, df, pd.Timedelta(time_interval)), axis='columns')
recompute_summary(df, '1min')
recompute_summary(df, '12h')
假设我有如下分钟的一些财务数据,我想写一个用户自定义函数(下面的代码又丑又复杂),如何获得5-minute/10-minute/30-minute/1 hour/8 hour/24 小时数据,行摘要使用 Python/pandas out of CSV?
TIME OPEN HIGH LOW CLOSE VOLUME
----------------------------------------------
0 1592194620 3046.00 3048.50 3046.00 3047.50 505
1 1592194630 3047.00 3048.00 3046.00 3047.00 162
2 1592194640 3047.50 3048.00 3047.00 3047.50 98
3 1592194650 3047.50 3047.50 3047.00 3047.50 228
4 1592194660 3048.00 3048.00 3047.50 3048.00 136
5 1592194670 3048.00 3048.00 3046.50 3046.50 174
6 1592194680 3046.50 3046.50 3045.00 3045.00 134
7 1592194690 3045.50 3046.00 3044.00 3045.00 43
8 1592194700 3045.00 3045.50 3045.00 3045.00 214
9 1592194710 3045.50 3045.50 3045.50 3045.50 8
10 1592194720 3045.50 3046.00 3044.50 3044.50 152
.......
.......
19999 1591594660 3048.00 3048.00 3047.50 3048.00 136
示例输出如下:
3048.50 2140 2020-06-13 04:34:00
3050.50 67 2020-06-13 04:35:00
3049.50 1489 2020-06-13 04:36:00
3047.50 987 2020-06-13 04:37:00
......
3099.50 2 2020-06-14 04:34:00
下面是我的愚蠢代码:
import pandas as pd
import pymysql
conn = pymysql.connect( host = "localhost",
user="root",
passwd="root",
db="demo")
sql = "SELECT TIME, OPEN, HIGH, LOW, CLOSE, VOLUME FROM demo_table;"
df = pd.read_sql(sql, conn)
# 12 hours for 1000 records
for i in range(1000, 20000-1000,1):
high_price = df.loc[i,['high']][0]
df_1000 = df.loc[i-1000:i]
df_high = df_1000[df_1000['high']>high_price]
high_count = df_high.shape[0]
df_last = df_high.tail(1)
time_dt = pd.Timestamp(df_last['TIME'], unit='s')
print(high_price, high_count, time_dt )
首先我建议阅读 CSV 并将 TIME 设置为索引:
import pandas as pd
import numpy as np
df = pd.read_csv(csv_file, delim_whitespace=True)
df['TIME'] = pd.to_datetime(df['TIME'], unit='s')
df.set_index('TIME', inplace=True)
如果您只是想将时间间隔缩短为另一个时间间隔(例如,从当前的 1 分钟变为 5 分钟),您可以使用 Dataframe.resample 方法轻松地重新采样:
# Tells what the aggregation should do for each column
colls_agg = {'OPEN': lambda x: x.iloc[0],
'HIGH': 'max',
'LOW': 'min',
'CLOSE': lambda x: x.iloc[-1],
'VOLUME': 'sum'}
def get_summary(df, time_interval):
# Tells what the aggregation should do for each column
return df.resample(pd.Timedelta(time_interval)).agg(colls_agg)
如果您希望数据帧的每一行都对应于最近 X 分钟的摘要(我相信这是您想要的),则需要为每一行重新计算它,如下所示。
colls_agg = {'OPEN': lambda x: x.iloc[0],
'HIGH': 'max',
'LOW': 'min',
'CLOSE': lambda x: x.iloc[-1],
'VOLUME': 'sum'}
def recompute_summary_line(line, full_df, time_interval):
"""Recomputes the summary for a line of the dataframe.
line should be a line of the dataframe,
full_df is the full dataframe
time_interval is the interval of time which will be selected"""
# Selects time betwen current time - time_interval
# until current time (including it)
lines_to_select = (full_df.index > line.name - time_interval) & \
(full_df.index <= (line.name))
agg_value = full_df[lines_to_select].agg(colls_agg)
# For the first few lines, this is not possible, so it returns nan
# Since we have included the current time, it will never happen.
# If you do NOT to include the current time, you might use this.
if agg_value.empty:
return pd.Series({'OPEN': np.nan, 'HIGH': np.nan,
'LOW': np.nan, 'VOLUME': np.nan})
return agg_value
def recompute_summary (df, time_interval):
"""Given a dataframe df, recomputes the summary for the
current time of each row using the information from the the previous
interval given in time_interval (for example '5min', '30s')"""
# Use df.apply to apply it in each line of the dataframe
return df.apply(lambda x: recompute_summary_line(
x, df, pd.Timedelta(time_interval)), axis='columns')
recompute_summary(df, '1min')
recompute_summary(df, '12h')