能否提高海量时序数据关联分析的计算速度？

Question

首先，我将介绍我的目标和实现它的代码

VALUE 是一个 3-d numpy 数组，表示 2-d 区域的时间变化。（例如value[:1000,2,3] = list [网格（X = 3，Y = 2）的值从0到1000s]。）

In my real work, VALUE is in the shape of (2812, 75, 90) ps:"2812" is the sum hour for 4 months
我称为SELECT的某个点代表了一个有趣的点，我将对该区域的每个网格进行相关分析。

SELECT is a pandas dataframe including each interesting point's X and Y
COV是一个3维数组作为计数矩阵记录每个SELECT[=33的相关水平=]点与每个网格点

Setting cut-off Pearson coefficient rc = 0.75,
for SELECT point t,
If r(i,j) > rc ==> cov[t,i,j] = 1, else cov[t,i,j] = 0

这是我的代码，但有点慢。我认为流程的某些部分可以改进：

start = timeit.default_timer() ### SELECT is a pandas dataframe including each interesting point's X and Y cov = np.zeros(len(SELECT)*VALUE.shape[1]*VALUE.shape[2]).reshape(len(SELECT), VALUE.shape[1],VALUE.shape[2]) for t in range(0,len(SELECT),1): select_grid = pd.DataFrame(VALUE[:,SELECT.Y.iloc[t],SELECT.X.iloc[t]]) for i in range(0,VALUE.shape[1],1): for j in range(0,VALUE.shape[2],1): data_grid = pd.DataFrame(VALUE[:,i,j]) ## Using corr to compute the correlation r r_sg = select_grid[0].corr(data_grid[0]) if r_sg > 0.75: cov[t,i,j] = 1 end = timeit.default_timer() print end - start

Answer 1

您的工作很耗时：SELECT 中每个样本大约需要一秒钟。

向量化不会给你带来很大的提升，因为耗时的corr函数在内循环中。

然而你可以有一个更轻的代码，pandas在这里不是绝对必要的。例如：

VALUE=random((2812,5,5))
select=pd.DataFrame(randint(0,5,(10,2)))
....
for (x,y) in select.values:
....
     r=np.corrcoef(VALUE[:,x,y],VALUE[:,i,j])[0,1]
....

[0,1]这里选择r，因为corrcoef计算的是一个2x2的数组。

您可以做的第一个优化是使用 numpy 数组而不是数据帧，以获得 2 倍的 corr 计算增益。

DFexample = pd.DataFrame(VALUE[:,0,:])

In [19]: %timeit np.corrcoef(VALUE[:,0,0],VALUE[:,0,1]) 
1000 loops, best of 3: 556 µs per loop

In [20]: %timeit DFexample[0].corr(DFexample[1])
1000 loops, best of 3: 1.09 ms per loop

另一个是 pre-compute 意味着 ans std，因为 r(x,y) = (<xy>-<x><y>)/σx/σy 获得 3 倍增益：

In [24]: s=VALUE.std(axis=0)  # 1 second

In [25]: m=VALUE.mean(axis=0) # 2 second

In [26]: %timeit ((VALUE[:,0,0]*VALUE[:,0,1]).mean() -m[0,1]*m[0,0])/s[0,0]/s[0,1]
10000 loops, best of 3: 172 µs per loop

In [31]: allclose(((VALUE[:,0,0]*VALUE[:,0,1]).mean() -m[0,1]*m[0,0])/s[0,0]/s[0,1],\
DFexample[0].corr(DFexample[1]))
Out[31]: True

所以你至少可以赢得 6 倍的因素。

能否提高海量时序数据关联分析的计算速度？

Can I increase the compution speed of correlation analysis between vast time-series data?

python

numpy

scipy

correlation

pandas