熊猫:与模式匹配的固定补丁滚动相关
pandas: Rolling correlation with fixed patch for pattern-matching
新年快乐。
我正在寻找一种方法来计算滚动 window 和固定 window ('patch') 与 pandas 的相关性。最终objective就是做模式匹配
根据我在文档中阅读的内容,希望我遗漏了一些东西,corr() 或 corrwith() 不允许您锁定其中一个系列/数据帧。
目前我能想到的最好的蹩脚解决方案如下所列。当这是在 50K 行上运行时,带有 30 个样本的补丁,处理时间进入 Ctrl+C 范围。
非常感谢您的建议和替代方案。谢谢。
请运行下面的代码,你会很清楚我要做什么:
import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
# Create test DataFrame df and a patch to be found.
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)
n = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=n, freq='5min')
patch = DataFrame(np.arange(n), columns=['a'], index=rng)
print
print ' *** Start corr example ***'
# To avoid the automatic alignment between df and patch,
# I need to reset the index.
patch.reset_index(inplace=True, drop=True)
# Cannot do:
# df.reset_index(inplace=True, drop=True)
df['corr'] = np.nan
for i in range(df.shape[0]):
window = df[i : i+patch.shape[0]]
# If slice has only two rows, I have a line between two points
# When I corr with to points in patch, I start getting
# misleading values like 1 or -1
if window.shape[0] != patch.shape[0] :
break
else:
# I need to reset_index for the window,
# which is less efficient than doing outside the
# for loop where the patch has its reset_index done.
# If I would do the df.reset_index up there,
# I would still have automatic realignment but
# by index.
window.reset_index(inplace=True, drop=True)
# On top of the obvious inefficiency
# of this method, I cannot just corrwith()
# between specific columns in the dataframe;
# corrwith() runs for all.
# Alternatively I could create a new DataFrame
# only with the needed columns:
# df_col = DataFrame(df.a)
# patch_col = DataFrame(patch.a)
# Alternatively I could join the patch to
# the df and shift it.
corr = window.corrwith(patch)
print
print '==========================='
print 'window:'
print window
print '---------------------------'
print 'patch:'
print patch
print '---------------------------'
print 'Corr for this window'
print corr
print '============================'
df['corr'][i] = corr.a
print
print ' *** End corr example ***'
print " Please inspect var 'df'"
print
显然,reset_index
的大量使用表明我们正在与 Panda 的索引和自动对齐作斗争。哦,如果我们可以忘记索引,事情会容易得多!
事实上,这就是 NumPy 的用途。 (一般来说,需要对齐或按索引分组时使用Pandas,对N维数组进行计算时使用NumPy。)
使用 NumPy 将使计算速度更快,因为我们将能够删除 for-loop
并将 for 循环中完成的所有计算处理为 一次计算 在滚动 windows.
的 NumPy 数组上完成
我们可以look inside pandas/core/frame.py
's DataFrame.corrwith
看看计算是如何完成的。然后将其转换为在 NumPy 数组上完成的相应代码,根据需要进行调整,因为我们希望对整个数组进行计算 windows 而不是一次只有一个 window,同时保持 patch
不变。 (注意:Pandas corrwith
方法处理 NaN。为了使代码更简单,我假设输入中没有 NaN。)
import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
import numpy.lib.stride_tricks as stride
np.random.seed(1)
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)
m = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=m, freq='5min')
patch = DataFrame(np.arange(m), columns=['a'], index=rng)
def orig(df, patch):
patch.reset_index(inplace=True, drop=True)
df['corr'] = np.nan
for i in range(df.shape[0]):
window = df[i : i+patch.shape[0]]
if window.shape[0] != patch.shape[0] :
break
else:
window.reset_index(inplace=True, drop=True)
corr = window.corrwith(patch)
df['corr'][i] = corr.a
return df
def using_numpy(df, patch):
left = df['a'].values
itemsize = left.itemsize
left = stride.as_strided(left, shape=(n-m+1, m), strides = (itemsize, itemsize))
right = patch['a'].values
ldem = left - left.mean(axis=1)[:, None]
rdem = right - right.mean()
num = (ldem * rdem).sum(axis=1)
dom = (m - 1) * np.sqrt(left.var(axis=1, ddof=1) * right.var(ddof=1))
correl = num/dom
df.ix[:len(correl), 'corr'] = correl
return df
expected = orig(df.copy(), patch.copy())
result = using_numpy(df.copy(), patch.copy())
print(expected)
print(result)
这证实了 orig
和 using_numpy
生成的值是
相同:
assert np.allclose(expected['corr'].dropna(), result['corr'].dropna())
技术说明:
为了以内存友好的方式创建充满滚动 windows 的数组,我 used a striding trick I learned here.
这是一个基准测试,使用 n, m = 1000, 4
(很多行和一个小补丁生成很多 windows):
In [77]: %timeit orig(df.copy(), patch.copy())
1 loops, best of 3: 3.56 s per loop
In [78]: %timeit using_numpy(df.copy(), patch.copy())
1000 loops, best of 3: 1.35 ms per loop
-- 2600 倍加速。
新年快乐。
我正在寻找一种方法来计算滚动 window 和固定 window ('patch') 与 pandas 的相关性。最终objective就是做模式匹配
根据我在文档中阅读的内容,希望我遗漏了一些东西,corr() 或 corrwith() 不允许您锁定其中一个系列/数据帧。
目前我能想到的最好的蹩脚解决方案如下所列。当这是在 50K 行上运行时,带有 30 个样本的补丁,处理时间进入 Ctrl+C 范围。
非常感谢您的建议和替代方案。谢谢。
请运行下面的代码,你会很清楚我要做什么:
import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
# Create test DataFrame df and a patch to be found.
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)
n = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=n, freq='5min')
patch = DataFrame(np.arange(n), columns=['a'], index=rng)
print
print ' *** Start corr example ***'
# To avoid the automatic alignment between df and patch,
# I need to reset the index.
patch.reset_index(inplace=True, drop=True)
# Cannot do:
# df.reset_index(inplace=True, drop=True)
df['corr'] = np.nan
for i in range(df.shape[0]):
window = df[i : i+patch.shape[0]]
# If slice has only two rows, I have a line between two points
# When I corr with to points in patch, I start getting
# misleading values like 1 or -1
if window.shape[0] != patch.shape[0] :
break
else:
# I need to reset_index for the window,
# which is less efficient than doing outside the
# for loop where the patch has its reset_index done.
# If I would do the df.reset_index up there,
# I would still have automatic realignment but
# by index.
window.reset_index(inplace=True, drop=True)
# On top of the obvious inefficiency
# of this method, I cannot just corrwith()
# between specific columns in the dataframe;
# corrwith() runs for all.
# Alternatively I could create a new DataFrame
# only with the needed columns:
# df_col = DataFrame(df.a)
# patch_col = DataFrame(patch.a)
# Alternatively I could join the patch to
# the df and shift it.
corr = window.corrwith(patch)
print
print '==========================='
print 'window:'
print window
print '---------------------------'
print 'patch:'
print patch
print '---------------------------'
print 'Corr for this window'
print corr
print '============================'
df['corr'][i] = corr.a
print
print ' *** End corr example ***'
print " Please inspect var 'df'"
print
显然,reset_index
的大量使用表明我们正在与 Panda 的索引和自动对齐作斗争。哦,如果我们可以忘记索引,事情会容易得多!
事实上,这就是 NumPy 的用途。 (一般来说,需要对齐或按索引分组时使用Pandas,对N维数组进行计算时使用NumPy。)
使用 NumPy 将使计算速度更快,因为我们将能够删除 for-loop
并将 for 循环中完成的所有计算处理为 一次计算 在滚动 windows.
我们可以look inside pandas/core/frame.py
's DataFrame.corrwith
看看计算是如何完成的。然后将其转换为在 NumPy 数组上完成的相应代码,根据需要进行调整,因为我们希望对整个数组进行计算 windows 而不是一次只有一个 window,同时保持 patch
不变。 (注意:Pandas corrwith
方法处理 NaN。为了使代码更简单,我假设输入中没有 NaN。)
import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
import numpy.lib.stride_tricks as stride
np.random.seed(1)
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)
m = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=m, freq='5min')
patch = DataFrame(np.arange(m), columns=['a'], index=rng)
def orig(df, patch):
patch.reset_index(inplace=True, drop=True)
df['corr'] = np.nan
for i in range(df.shape[0]):
window = df[i : i+patch.shape[0]]
if window.shape[0] != patch.shape[0] :
break
else:
window.reset_index(inplace=True, drop=True)
corr = window.corrwith(patch)
df['corr'][i] = corr.a
return df
def using_numpy(df, patch):
left = df['a'].values
itemsize = left.itemsize
left = stride.as_strided(left, shape=(n-m+1, m), strides = (itemsize, itemsize))
right = patch['a'].values
ldem = left - left.mean(axis=1)[:, None]
rdem = right - right.mean()
num = (ldem * rdem).sum(axis=1)
dom = (m - 1) * np.sqrt(left.var(axis=1, ddof=1) * right.var(ddof=1))
correl = num/dom
df.ix[:len(correl), 'corr'] = correl
return df
expected = orig(df.copy(), patch.copy())
result = using_numpy(df.copy(), patch.copy())
print(expected)
print(result)
这证实了 orig
和 using_numpy
生成的值是
相同:
assert np.allclose(expected['corr'].dropna(), result['corr'].dropna())
技术说明:
为了以内存友好的方式创建充满滚动 windows 的数组,我 used a striding trick I learned here.
这是一个基准测试,使用 n, m = 1000, 4
(很多行和一个小补丁生成很多 windows):
In [77]: %timeit orig(df.copy(), patch.copy())
1 loops, best of 3: 3.56 s per loop
In [78]: %timeit using_numpy(df.copy(), patch.copy())
1000 loops, best of 3: 1.35 ms per loop
-- 2600 倍加速。