使用 pandas 高效计算控制点
Efficiently calculating point of control with pandas
在每天的时间范围内实施此功能时,我的算法运行时间从 35 秒增加到 15 分钟。该算法批量检索每日历史记录并迭代数据帧的一个子集(从 t0 到 tX,其中 tX 是迭代的当前行)。它这样做是为了模拟在算法的实时操作过程中会发生什么。我知道有一些方法可以通过在帧计算之间利用内存来改进它,但我想知道是否有更多 pandas-ish 实现可以立即受益。
假设 self.Step
类似于 0.00001
而 self.Precision
类似于 5
;为了找到 poc,它们用于将 ohlc bar 信息合并为离散的步骤。 _frame
是整个数据帧的行的子集,_low
/_high
是相应的。以下代码块在整个 _frame
上执行,每次算法添加新行时(计算每日数据的年度时间范围时),整个 _frame
可能超过 250 行。我相信是 iterrows 导致了主要的放缓。数据框包含 high
、low
、open
、close
、volume
等列。我正在计算时间价格机会和数量控制点。
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - self.Step, _high + self.Step, self.Step), decimals=self.Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.low, state.high, self.Step), decimals=self.Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax()) / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax()) / 2)
您可以将此函数用作基础并对其进行调整:
def f(x): #function to find the POC price and volume
a = x['tradePrice'].value_counts().index[0]
b = x.loc[x['tradePrice'] == a, 'tradeVolume'].sum()
return pd.Series([a,b],['POC_Price','POC_Volume'])
无论如何,我已经设法将它减少到 2 分钟而不是 15 分钟 - 在每天的时间范围内。它在较短的时间范围内仍然很慢(在 2 年期间每小时 10 分钟,股票精度为 2)。使用 DataFrame 与使用 Series 相比要慢得多。我希望得到更多,但我不知道除了以下解决方案之外我还能做什么:
# Upon class instantiation, I've created attributes for each timeframe
# related to `volume_at_price` and `time_at_price`. They serve as memory
# in between frame calculations
def _prices_at(self, frame, bars=0):
# Include 1 step above high as np.arange does not
# include the upper limit by default
state = frame.iloc[-min(bars + 1, frame.index.size)]
bins = np.around(np.arange(state.low, state.high + self.Step, self.Step), decimals=self.Precision)
return pd.Series(state.volume / bins.size, index=bins)
# SetFeature/Feature implement timeframed attributes (i.e., 'volume_at_price_D')
_v = 'volume_at_price'
_t = 'time_at_price'
# Add to x_at_price histogram
_p = self._prices_at(frame)
self.SetFeature(_v, self.Feature(_v).add(_p, fill_value=0))
self.SetFeature(_t, self.Feature(_t).add(_p * 0 + 1, fill_value=0))
# Remove old data from histogram
_p = self._prices_at(frame, self.Bars)
v = self.SetFeature(_v, self.Feature(_v).subtract(_p, fill_value=0))
t = self.SetFeature(_t, self.Feature(_t).subtract(_p * 0 + 1, fill_value=0))
self.SetFeature('volume_poc', (v.idxmax() + v.iloc[::-1].idxmax()) / 2)
self.SetFeature('time_poc', (t.idxmax() + t.iloc[::-1].idxmax()) / 2)
这是我的计算结果。我仍然不确定您的代码生成的答案是否正确,我认为您的行 volume_prices[_prices] += state.Volume / _prices.size
并未应用于 volume_prices 中的每条记录,但这里是基准测试。大约 9 倍的改进。
def vpOriginal():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
time_prices2 = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.Low, state.High, Step), decimals=Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.Volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
time_prices2 += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
# print(volume_prices.head(10))
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
def vpNoDF():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around((state.High - state.Low) / Step , 0)
# Evenly distribute the bar's volume over its range
volume_prices.loc[state.Low:state.High] += state.Volume / _prices
# Increment time at price
time_prices.loc[state.Low:state.High] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
getData()
Out[8]:
Date Open High Low Close Volume Adj Close
0 2008-10-14 116.26 116.40 103.14 104.08 70749800 104.08
1 2008-10-13 104.55 110.53 101.02 110.26 54967000 110.26
2 2008-10-10 85.70 100.00 85.00 96.80 79260700 96.80
3 2008-10-09 93.35 95.80 86.60 88.74 57763700 88.74
4 2008-10-08 85.91 96.33 85.68 89.79 78847900 89.79
5 2008-10-07 100.48 101.50 88.95 89.16 67099000 89.16
6 2008-10-06 91.96 98.78 87.54 98.14 75264900 98.14
7 2008-10-03 104.00 106.50 94.65 97.07 81942800 97.07
8 2008-10-02 108.01 108.79 100.00 100.10 57477300 100.10
9 2008-10-01 111.92 112.36 107.39 109.12 46303000 109.12
vpOriginal()
Out[9]: (142.55000000000001, 142.55000000000001)
vpNoDF()
Out[10]: (142.55000000000001, 142.55000000000001)
%timeit vpOriginal()
2.79 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vpNoDF()
300 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
在每天的时间范围内实施此功能时,我的算法运行时间从 35 秒增加到 15 分钟。该算法批量检索每日历史记录并迭代数据帧的一个子集(从 t0 到 tX,其中 tX 是迭代的当前行)。它这样做是为了模拟在算法的实时操作过程中会发生什么。我知道有一些方法可以通过在帧计算之间利用内存来改进它,但我想知道是否有更多 pandas-ish 实现可以立即受益。
假设 self.Step
类似于 0.00001
而 self.Precision
类似于 5
;为了找到 poc,它们用于将 ohlc bar 信息合并为离散的步骤。 _frame
是整个数据帧的行的子集,_low
/_high
是相应的。以下代码块在整个 _frame
上执行,每次算法添加新行时(计算每日数据的年度时间范围时),整个 _frame
可能超过 250 行。我相信是 iterrows 导致了主要的放缓。数据框包含 high
、low
、open
、close
、volume
等列。我正在计算时间价格机会和数量控制点。
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - self.Step, _high + self.Step, self.Step), decimals=self.Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.low, state.high, self.Step), decimals=self.Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax()) / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax()) / 2)
您可以将此函数用作基础并对其进行调整:
def f(x): #function to find the POC price and volume
a = x['tradePrice'].value_counts().index[0]
b = x.loc[x['tradePrice'] == a, 'tradeVolume'].sum()
return pd.Series([a,b],['POC_Price','POC_Volume'])
无论如何,我已经设法将它减少到 2 分钟而不是 15 分钟 - 在每天的时间范围内。它在较短的时间范围内仍然很慢(在 2 年期间每小时 10 分钟,股票精度为 2)。使用 DataFrame 与使用 Series 相比要慢得多。我希望得到更多,但我不知道除了以下解决方案之外我还能做什么:
# Upon class instantiation, I've created attributes for each timeframe
# related to `volume_at_price` and `time_at_price`. They serve as memory
# in between frame calculations
def _prices_at(self, frame, bars=0):
# Include 1 step above high as np.arange does not
# include the upper limit by default
state = frame.iloc[-min(bars + 1, frame.index.size)]
bins = np.around(np.arange(state.low, state.high + self.Step, self.Step), decimals=self.Precision)
return pd.Series(state.volume / bins.size, index=bins)
# SetFeature/Feature implement timeframed attributes (i.e., 'volume_at_price_D')
_v = 'volume_at_price'
_t = 'time_at_price'
# Add to x_at_price histogram
_p = self._prices_at(frame)
self.SetFeature(_v, self.Feature(_v).add(_p, fill_value=0))
self.SetFeature(_t, self.Feature(_t).add(_p * 0 + 1, fill_value=0))
# Remove old data from histogram
_p = self._prices_at(frame, self.Bars)
v = self.SetFeature(_v, self.Feature(_v).subtract(_p, fill_value=0))
t = self.SetFeature(_t, self.Feature(_t).subtract(_p * 0 + 1, fill_value=0))
self.SetFeature('volume_poc', (v.idxmax() + v.iloc[::-1].idxmax()) / 2)
self.SetFeature('time_poc', (t.idxmax() + t.iloc[::-1].idxmax()) / 2)
这是我的计算结果。我仍然不确定您的代码生成的答案是否正确,我认为您的行 volume_prices[_prices] += state.Volume / _prices.size
并未应用于 volume_prices 中的每条记录,但这里是基准测试。大约 9 倍的改进。
def vpOriginal():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
time_prices2 = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.Low, state.High, Step), decimals=Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.Volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
time_prices2 += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
# print(volume_prices.head(10))
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
def vpNoDF():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around((state.High - state.Low) / Step , 0)
# Evenly distribute the bar's volume over its range
volume_prices.loc[state.Low:state.High] += state.Volume / _prices
# Increment time at price
time_prices.loc[state.Low:state.High] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
getData()
Out[8]:
Date Open High Low Close Volume Adj Close
0 2008-10-14 116.26 116.40 103.14 104.08 70749800 104.08
1 2008-10-13 104.55 110.53 101.02 110.26 54967000 110.26
2 2008-10-10 85.70 100.00 85.00 96.80 79260700 96.80
3 2008-10-09 93.35 95.80 86.60 88.74 57763700 88.74
4 2008-10-08 85.91 96.33 85.68 89.79 78847900 89.79
5 2008-10-07 100.48 101.50 88.95 89.16 67099000 89.16
6 2008-10-06 91.96 98.78 87.54 98.14 75264900 98.14
7 2008-10-03 104.00 106.50 94.65 97.07 81942800 97.07
8 2008-10-02 108.01 108.79 100.00 100.10 57477300 100.10
9 2008-10-01 111.92 112.36 107.39 109.12 46303000 109.12
vpOriginal()
Out[9]: (142.55000000000001, 142.55000000000001)
vpNoDF()
Out[10]: (142.55000000000001, 142.55000000000001)
%timeit vpOriginal()
2.79 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vpNoDF()
300 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)