两个特征矩阵的有效成对相关
Efficient pairwise correlation for two matrices of features
在 Python 中,我需要找到矩阵 A
中所有特征与矩阵 B
中所有特征之间的成对相关性。特别是,我感兴趣的是找到 A
中给定特征与 B
中所有特征的最强 Pearson 相关性。我不关心最强相关性是正相关还是负相关。
我使用下面的两个循环和 scipy 完成了一个低效的实现。但是,我想使用 np.corrcoef
或其他类似的方法来有效地计算它。矩阵 A
的形状为 40000x400,B
的形状为 40000x1440。我尝试有效地做到这一点可以在下面看到作为方法 find_max_absolute_corr(A,B)
。但是,它失败并出现以下错误:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
.
import numpy as np
from scipy.stats import pearsonr
def find_max_absolute_corr(A, B):
""" Finds for each feature in `A` the highest Pearson
correlation across all features in `B`. """
max_corr_A = np.zeros((A.shape[1]))
for A_col in range(A.shape[1]):
print "Calculating {}/{}.".format(A_col+1, A.shape[1])
metric = A[:,A_col]
pearson = np.corrcoef(B, metric, rowvar=0)
# takes negative correlations into account as well
min_p = min(pearson)
max_p = max(pearson)
max_corr_A[A_col] = max_absolute(min_p, max_p)
return max_corr_A
def max_absolute(min_p, max_p):
if np.isnan(min_p) or np.isnan(max_p):
raise ValueError("NaN correlation.")
if abs(max_p) > abs(min_p):
return max_p
else:
return min_p
if __name__ == '__main__':
A = np.array(
[[10, 8.04, 9.14, 7.46],
[8, 6.95, 8.14, 6.77],
[13, 7.58, 8.74, 12.74],
[9, 8.81, 8.77, 7.11],
[11, 8.33, 9.26, 7.81]])
B = np.array(
[[-14, -9.96, 8.10, 8.84, 8, 7.04],
[-6, -7.24, 6.13, 6.08, 5, 5.25],
[-4, -4.26, 3.10, 5.39, 8, 5.56],
[-12, -10.84, 9.13, 8.15, 5, 7.91],
[-7, -4.82, 7.26, 6.42, 8, 6.89]])
# simple, inefficient method
for A_col in range(A.shape[1]):
high_corr = 0
for B_col in range(B.shape[1]):
corr,_ = pearsonr(A[:,A_col], B[:,B_col])
high_corr = max_absolute(high_corr, corr)
print high_corr
# -0.161314601631
# 0.956781516149
# 0.621071009239
# -0.421539304112
# efficient method
max_corr_A = find_max_absolute_corr(A, B)
print max_corr_A
# [-0.161314601631,
# 0.956781516149,
# 0.621071009239,
# -0.421539304112]
似乎 scipy.stats.pearsonr
遵循应用于 A
& B
-
的逐列对的 Pearson 相关系数公式的定义
基于该公式,您可以轻松矢量化,因为 A
和 B
的列的成对计算彼此独立。这是一个使用 broadcasting
-
的矢量化解决方案
# Get number of rows in either A or B
N = B.shape[0]
# Store columnw-wise in A and B, as they would be used at few places
sA = A.sum(0)
sB = B.sum(0)
# Basically there are four parts in the formula. We would compute them one-by-one
p1 = N*np.einsum('ij,ik->kj',A,B)
p2 = sA*sB[:,None]
p3 = N*((B**2).sum(0)) - (sB**2)
p4 = N*((A**2).sum(0)) - (sA**2)
# Finally compute Pearson Correlation Coefficient as 2D array
pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
# Get the element corresponding to absolute argmax along the columns
out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
样本运行-
1) 输入:
In [12]: A
Out[12]:
array([[ 10. , 8.04, 9.14, 7.46],
[ 8. , 6.95, 8.14, 6.77],
[ 13. , 7.58, 8.74, 12.74],
[ 9. , 8.81, 8.77, 7.11],
[ 11. , 8.33, 9.26, 7.81]])
In [13]: B
Out[13]:
array([[-14. , -9.96, 8.1 , 8.84, 8. , 7.04],
[ -6. , -7.24, 6.13, 6.08, 5. , 5.25],
[ -4. , -4.26, 3.1 , 5.39, 8. , 5.56],
[-12. , -10.84, 9.13, 8.15, 5. , 7.91],
[ -7. , -4.82, 7.26, 6.42, 8. , 6.89]])
2) 原始循环代码 运行 -
In [14]: high_corr_out = np.zeros(A.shape[1])
...: for A_col in range(A.shape[1]):
...: high_corr = 0
...: for B_col in range(B.shape[1]):
...: corr,_ = pearsonr(A[:,A_col], B[:,B_col])
...: high_corr = max_absolute(high_corr, corr)
...: high_corr_out[A_col] = high_corr
...:
In [15]: high_corr_out
Out[15]: array([ 0.8067843 , 0.95678152, 0.74016181, -0.85127779])
3)建议代码运行-
In [16]: N = B.shape[0]
...: sA = A.sum(0)
...: sB = B.sum(0)
...: p1 = N*np.einsum('ij,ik->kj',A,B)
...: p2 = sA*sB[:,None]
...: p3 = N*((B**2).sum(0)) - (sB**2)
...: p4 = N*((A**2).sum(0)) - (sA**2)
...: pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
...: out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
...:
In [17]: pcorr # Pearson Correlation Coefficient array
Out[17]:
array([[ 0.41895565, -0.5910935 , -0.40465987, 0.5818286 ],
[ 0.66609445, -0.41950457, 0.02450215, 0.64028344],
[-0.64953314, 0.65669916, 0.30836196, -0.85127779],
[-0.41917583, 0.59043266, 0.40364532, -0.58144102],
[ 0.8067843 , 0.07947386, 0.74016181, 0.53165395],
[-0.1613146 , 0.95678152, 0.62107101, -0.4215393 ]])
In [18]: out # elements corresponding to absolute argmax along columns
Out[18]: array([ 0.8067843 , 0.95678152, 0.74016181, -0.85127779])
运行时测试 -
In [36]: A = np.random.rand(4000,40)
In [37]: B = np.random.rand(4000,144)
In [38]: np.allclose(org_app(A,B),proposed_app(A,B))
Out[38]: True
In [39]: %timeit org_app(A,B) # Original approach
1 loops, best of 3: 1.35 s per loop
In [40]: %timeit proposed_app(A,B) # Proposed vectorized approach
10 loops, best of 3: 39.1 ms per loop
根据个人经验添加上述答案,
p1 = N*np.dot(B.T,A)
与
相比,我的工作速度更快
p1 = N*np.einsum('ij,ik->kj',A,B)
当 A 和 B 是大维矩阵时尤其如此。
在 Python 中,我需要找到矩阵 A
中所有特征与矩阵 B
中所有特征之间的成对相关性。特别是,我感兴趣的是找到 A
中给定特征与 B
中所有特征的最强 Pearson 相关性。我不关心最强相关性是正相关还是负相关。
我使用下面的两个循环和 scipy 完成了一个低效的实现。但是,我想使用 np.corrcoef
或其他类似的方法来有效地计算它。矩阵 A
的形状为 40000x400,B
的形状为 40000x1440。我尝试有效地做到这一点可以在下面看到作为方法 find_max_absolute_corr(A,B)
。但是,它失败并出现以下错误:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
.
import numpy as np
from scipy.stats import pearsonr
def find_max_absolute_corr(A, B):
""" Finds for each feature in `A` the highest Pearson
correlation across all features in `B`. """
max_corr_A = np.zeros((A.shape[1]))
for A_col in range(A.shape[1]):
print "Calculating {}/{}.".format(A_col+1, A.shape[1])
metric = A[:,A_col]
pearson = np.corrcoef(B, metric, rowvar=0)
# takes negative correlations into account as well
min_p = min(pearson)
max_p = max(pearson)
max_corr_A[A_col] = max_absolute(min_p, max_p)
return max_corr_A
def max_absolute(min_p, max_p):
if np.isnan(min_p) or np.isnan(max_p):
raise ValueError("NaN correlation.")
if abs(max_p) > abs(min_p):
return max_p
else:
return min_p
if __name__ == '__main__':
A = np.array(
[[10, 8.04, 9.14, 7.46],
[8, 6.95, 8.14, 6.77],
[13, 7.58, 8.74, 12.74],
[9, 8.81, 8.77, 7.11],
[11, 8.33, 9.26, 7.81]])
B = np.array(
[[-14, -9.96, 8.10, 8.84, 8, 7.04],
[-6, -7.24, 6.13, 6.08, 5, 5.25],
[-4, -4.26, 3.10, 5.39, 8, 5.56],
[-12, -10.84, 9.13, 8.15, 5, 7.91],
[-7, -4.82, 7.26, 6.42, 8, 6.89]])
# simple, inefficient method
for A_col in range(A.shape[1]):
high_corr = 0
for B_col in range(B.shape[1]):
corr,_ = pearsonr(A[:,A_col], B[:,B_col])
high_corr = max_absolute(high_corr, corr)
print high_corr
# -0.161314601631
# 0.956781516149
# 0.621071009239
# -0.421539304112
# efficient method
max_corr_A = find_max_absolute_corr(A, B)
print max_corr_A
# [-0.161314601631,
# 0.956781516149,
# 0.621071009239,
# -0.421539304112]
似乎 scipy.stats.pearsonr
遵循应用于 A
& B
-
基于该公式,您可以轻松矢量化,因为 A
和 B
的列的成对计算彼此独立。这是一个使用 broadcasting
-
# Get number of rows in either A or B
N = B.shape[0]
# Store columnw-wise in A and B, as they would be used at few places
sA = A.sum(0)
sB = B.sum(0)
# Basically there are four parts in the formula. We would compute them one-by-one
p1 = N*np.einsum('ij,ik->kj',A,B)
p2 = sA*sB[:,None]
p3 = N*((B**2).sum(0)) - (sB**2)
p4 = N*((A**2).sum(0)) - (sA**2)
# Finally compute Pearson Correlation Coefficient as 2D array
pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
# Get the element corresponding to absolute argmax along the columns
out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
样本运行-
1) 输入:
In [12]: A
Out[12]:
array([[ 10. , 8.04, 9.14, 7.46],
[ 8. , 6.95, 8.14, 6.77],
[ 13. , 7.58, 8.74, 12.74],
[ 9. , 8.81, 8.77, 7.11],
[ 11. , 8.33, 9.26, 7.81]])
In [13]: B
Out[13]:
array([[-14. , -9.96, 8.1 , 8.84, 8. , 7.04],
[ -6. , -7.24, 6.13, 6.08, 5. , 5.25],
[ -4. , -4.26, 3.1 , 5.39, 8. , 5.56],
[-12. , -10.84, 9.13, 8.15, 5. , 7.91],
[ -7. , -4.82, 7.26, 6.42, 8. , 6.89]])
2) 原始循环代码 运行 -
In [14]: high_corr_out = np.zeros(A.shape[1])
...: for A_col in range(A.shape[1]):
...: high_corr = 0
...: for B_col in range(B.shape[1]):
...: corr,_ = pearsonr(A[:,A_col], B[:,B_col])
...: high_corr = max_absolute(high_corr, corr)
...: high_corr_out[A_col] = high_corr
...:
In [15]: high_corr_out
Out[15]: array([ 0.8067843 , 0.95678152, 0.74016181, -0.85127779])
3)建议代码运行-
In [16]: N = B.shape[0]
...: sA = A.sum(0)
...: sB = B.sum(0)
...: p1 = N*np.einsum('ij,ik->kj',A,B)
...: p2 = sA*sB[:,None]
...: p3 = N*((B**2).sum(0)) - (sB**2)
...: p4 = N*((A**2).sum(0)) - (sA**2)
...: pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
...: out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
...:
In [17]: pcorr # Pearson Correlation Coefficient array
Out[17]:
array([[ 0.41895565, -0.5910935 , -0.40465987, 0.5818286 ],
[ 0.66609445, -0.41950457, 0.02450215, 0.64028344],
[-0.64953314, 0.65669916, 0.30836196, -0.85127779],
[-0.41917583, 0.59043266, 0.40364532, -0.58144102],
[ 0.8067843 , 0.07947386, 0.74016181, 0.53165395],
[-0.1613146 , 0.95678152, 0.62107101, -0.4215393 ]])
In [18]: out # elements corresponding to absolute argmax along columns
Out[18]: array([ 0.8067843 , 0.95678152, 0.74016181, -0.85127779])
运行时测试 -
In [36]: A = np.random.rand(4000,40)
In [37]: B = np.random.rand(4000,144)
In [38]: np.allclose(org_app(A,B),proposed_app(A,B))
Out[38]: True
In [39]: %timeit org_app(A,B) # Original approach
1 loops, best of 3: 1.35 s per loop
In [40]: %timeit proposed_app(A,B) # Proposed vectorized approach
10 loops, best of 3: 39.1 ms per loop
根据个人经验添加上述答案,
p1 = N*np.dot(B.T,A)
与
相比,我的工作速度更快p1 = N*np.einsum('ij,ik->kj',A,B)
当 A 和 B 是大维矩阵时尤其如此。