扩展时在列中向后搜索的数据框快速方法
Dataframe fast way to search backward in columns while expanding
这是我的价格数据框
stock1 stock2
0 2.3 10.1
1 1.9 11.2
2 3.5 10.5
3 2.8 10.8
4 3.1 10.3
5 2.7 9.8
6 3.3 10.2
这是我要获取的价格支持数据框
stock1 stock2
0 NaN NaN
1 1.9 10.1
2 1.9 10.1
3 1.9 10.5
4 2.8 10.1
5 1.9 NaN
6 2.7 9.8
让我们关注第一列价格。这个想法是用这种方式计算下峰
prices1 = prices['stock1']
mask = (prices1.shift(1) > prices1) & (prices1.shift(-1) > prices1)
supports1 = prices1.where(mask, NaN)
supports1.iloc[0] = min(prices1[0],prices1[1])
supports1 = supports1.shift(1).fillna(method='ffill')
我们得到
stock1
0 NaN
1 NaN
2 1.9
3 1.9
4 2.8
5 2.8
6 2.7
另一个规则是,对于每个价格,支撑必须更低。这不会发生在第 5 行,因为 2.8 > 2.7。要更正,我们必须在此支持列中向后查看以找到低于当前价格的第一次出现(如果存在,否则为 NaN)。在这种情况下,正确的值为 1.9
我找到了 2 种解决问题的方法,但我需要迭代,当数据帧增加时,它变得非常慢。我想要快 10 倍,希望是 100 倍。
这是我的代码
from pandas import DataFrame
from numpy import NaN
from numpy.random import uniform
from timeit import timeit
##rows = 5000
##cols = 10
##d={}
##for i in range(cols):
## d['stock_{}'.format(i)] = 100*uniform(0.95,1.05,rows).cumprod()
##prices = DataFrame(d)
prices = DataFrame({'stock1':[2.3, 1.9, 3.5, 2.8, 3.1, 2.7, 3.3],\
'stock2':[10.1, 11.2, 10.5, 10.8, 10.3, 9.8, 10.2]})
#----------------------------------------------------------------
def calc_supports1(prices):
supports = DataFrame().reindex_like(prices)
for stock in prices:
prices1 = prices[stock]
mask = (prices1.shift(1) > prices1) & (prices1.shift(-1) > prices1)
supports1 = prices1.where(mask, NaN)
supports1.iloc[0] = min(prices1[0],prices1[1])
supports1 = supports1.shift(1).fillna(method='ffill')
sup = supports1.drop_duplicates()
for i,v in prices1.loc[prices1 < supports1].iteritems():
mask = (sup.index < i) & (sup < v)
sup2 = sup.values[mask.values]
supports1.at[i] = sup2[-1] if len(sup2) > 0 else NaN
supports[stock] = supports1
return supports
#----------------------------------------------------------------
def calc_supports2(prices):
supports = DataFrame().reindex_like(prices)
for stock in prices:
prices1 = prices[stock]
sup = [min(prices1[0],prices1[1])]
supports1 = [NaN, sup[0]]
for i in xrange(2,len(prices1)):
while len(sup) > 0 and prices1[i] < sup[0]:
sup.pop(0)
if prices1[i-1]<prices1[i] and prices1[i-1]<prices1[i-2]:
sup.insert(0, prices1[i-1])
supports1.append(sup[0] if len(sup) > 0 else NaN)
supports[stock] = supports1
return supports
#----------------------------------------------------------------
print 'fun1', timeit('calc_supports1(prices)', \
setup='from __main__ import calc_supports1, prices',number = 1)
print 'fun2', timeit('calc_supports2(prices)', \
setup='from __main__ import calc_supports2, prices',number = 1)
我怎样才能加快速度?
此代码中存在多个性能问题。
- Python 循环很慢(使用默认解释器 CPython)。最好使用 Cython 或 Numba 之类的东西进行此类计算。
- Pandas 手动索引很慢。最好在这里使用 Python 列表,甚至避免在代码的热点部分建立索引。
- Inserting/Removing 列表开头的元素可能要慢得多。 Python 文档建议在末尾添加元素以实现堆栈。
这是更正后的代码:
def calc_supports3(prices):
supports = DataFrame().reindex_like(prices)
for stock in prices:
prices1 = list(prices[stock])
sup = [min(prices1[0],prices1[1])]
supports1 = [NaN, sup[-1]]
# Sliding window
prices1_im2 = NaN
prices1_im1 = prices1[0]
prices1_im0 = prices1[1]
for i in xrange(2,len(prices1)):
prices1_im2, prices1_im1, prices1_im0 = prices1_im1, prices1_im0, prices1[i]
while len(sup) > 0 and prices1_im0 < sup[-1]:
sup.pop()
if prices1_im1<prices1_im0 and prices1_im1<prices1_im2:
sup.append(prices1_im1)
supports1.append(sup[-1] if len(sup) > 0 else NaN)
supports[stock] = supports1
return supports
以下是您在我的机器上的小型数据集的性能结果:
fun1 0.006461 s
fun2 0.000901 s
fun3 0.000648 s (40% faster than fun2)
以下是随机生成的包含 50 000 行的数据集的性能结果:
fun1 3.916947 s
fun2 2.064891 s
fun3 0.034465 s (60 times faster than fun2)
这是我的价格数据框
stock1 stock2
0 2.3 10.1
1 1.9 11.2
2 3.5 10.5
3 2.8 10.8
4 3.1 10.3
5 2.7 9.8
6 3.3 10.2
这是我要获取的价格支持数据框
stock1 stock2
0 NaN NaN
1 1.9 10.1
2 1.9 10.1
3 1.9 10.5
4 2.8 10.1
5 1.9 NaN
6 2.7 9.8
让我们关注第一列价格。这个想法是用这种方式计算下峰
prices1 = prices['stock1']
mask = (prices1.shift(1) > prices1) & (prices1.shift(-1) > prices1)
supports1 = prices1.where(mask, NaN)
supports1.iloc[0] = min(prices1[0],prices1[1])
supports1 = supports1.shift(1).fillna(method='ffill')
我们得到
stock1
0 NaN
1 NaN
2 1.9
3 1.9
4 2.8
5 2.8
6 2.7
另一个规则是,对于每个价格,支撑必须更低。这不会发生在第 5 行,因为 2.8 > 2.7。要更正,我们必须在此支持列中向后查看以找到低于当前价格的第一次出现(如果存在,否则为 NaN)。在这种情况下,正确的值为 1.9
我找到了 2 种解决问题的方法,但我需要迭代,当数据帧增加时,它变得非常慢。我想要快 10 倍,希望是 100 倍。 这是我的代码
from pandas import DataFrame
from numpy import NaN
from numpy.random import uniform
from timeit import timeit
##rows = 5000
##cols = 10
##d={}
##for i in range(cols):
## d['stock_{}'.format(i)] = 100*uniform(0.95,1.05,rows).cumprod()
##prices = DataFrame(d)
prices = DataFrame({'stock1':[2.3, 1.9, 3.5, 2.8, 3.1, 2.7, 3.3],\
'stock2':[10.1, 11.2, 10.5, 10.8, 10.3, 9.8, 10.2]})
#----------------------------------------------------------------
def calc_supports1(prices):
supports = DataFrame().reindex_like(prices)
for stock in prices:
prices1 = prices[stock]
mask = (prices1.shift(1) > prices1) & (prices1.shift(-1) > prices1)
supports1 = prices1.where(mask, NaN)
supports1.iloc[0] = min(prices1[0],prices1[1])
supports1 = supports1.shift(1).fillna(method='ffill')
sup = supports1.drop_duplicates()
for i,v in prices1.loc[prices1 < supports1].iteritems():
mask = (sup.index < i) & (sup < v)
sup2 = sup.values[mask.values]
supports1.at[i] = sup2[-1] if len(sup2) > 0 else NaN
supports[stock] = supports1
return supports
#----------------------------------------------------------------
def calc_supports2(prices):
supports = DataFrame().reindex_like(prices)
for stock in prices:
prices1 = prices[stock]
sup = [min(prices1[0],prices1[1])]
supports1 = [NaN, sup[0]]
for i in xrange(2,len(prices1)):
while len(sup) > 0 and prices1[i] < sup[0]:
sup.pop(0)
if prices1[i-1]<prices1[i] and prices1[i-1]<prices1[i-2]:
sup.insert(0, prices1[i-1])
supports1.append(sup[0] if len(sup) > 0 else NaN)
supports[stock] = supports1
return supports
#----------------------------------------------------------------
print 'fun1', timeit('calc_supports1(prices)', \
setup='from __main__ import calc_supports1, prices',number = 1)
print 'fun2', timeit('calc_supports2(prices)', \
setup='from __main__ import calc_supports2, prices',number = 1)
我怎样才能加快速度?
此代码中存在多个性能问题。
- Python 循环很慢(使用默认解释器 CPython)。最好使用 Cython 或 Numba 之类的东西进行此类计算。
- Pandas 手动索引很慢。最好在这里使用 Python 列表,甚至避免在代码的热点部分建立索引。
- Inserting/Removing 列表开头的元素可能要慢得多。 Python 文档建议在末尾添加元素以实现堆栈。
这是更正后的代码:
def calc_supports3(prices):
supports = DataFrame().reindex_like(prices)
for stock in prices:
prices1 = list(prices[stock])
sup = [min(prices1[0],prices1[1])]
supports1 = [NaN, sup[-1]]
# Sliding window
prices1_im2 = NaN
prices1_im1 = prices1[0]
prices1_im0 = prices1[1]
for i in xrange(2,len(prices1)):
prices1_im2, prices1_im1, prices1_im0 = prices1_im1, prices1_im0, prices1[i]
while len(sup) > 0 and prices1_im0 < sup[-1]:
sup.pop()
if prices1_im1<prices1_im0 and prices1_im1<prices1_im2:
sup.append(prices1_im1)
supports1.append(sup[-1] if len(sup) > 0 else NaN)
supports[stock] = supports1
return supports
以下是您在我的机器上的小型数据集的性能结果:
fun1 0.006461 s
fun2 0.000901 s
fun3 0.000648 s (40% faster than fun2)
以下是随机生成的包含 50 000 行的数据集的性能结果:
fun1 3.916947 s
fun2 2.064891 s
fun3 0.034465 s (60 times faster than fun2)