跨包含数千个相关矩阵的字典的高效计算

Question

基于来自 20 种资产的日常观察的大型数据集，我创建了一个包含（滚动）相关矩阵的字典。我正在使用日期索引作为字典的键。

我现在想做的（以一种有效的方式）是比较字典中的所有相关矩阵并将结果保存在一个新矩阵中。这个想法是随着时间的推移比较相关结构。

import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import cophenet


key_list = dict_corr.keys()

# Create empty matrix
X = np.empty(shape=[len(key_list),len(key_list)])

key1_index = 0
key2_index = 0
for key1 in key_list:


    # Extract correlation matrix from dictionary
    corr1_temp = d[key1]

    # Transform correlation matrix into distance matrix
    dist1_temp = ((1-corr1_temp)/2.)**.5

    # Extract hierarchical structure from distance matrix
    link1_temp = linkage(dist1_temp,'single') 

    for key2 in key_list:

        corr2_temp = d[key2]
        dist2_temp = ((1-corr2_temp)/2.)**.5
        link2_temp = linkage(dist2_temp,'single')

        # Compare hierarchical structure between the two correlation matrizes -> results in 2x2 matrix
        temp = np.corrcoef(cophenet(link1_temp),cophenet(link2_temp))

        # Extract from the resulting 2x2 matrix the correlation
        X[key1_index, key2_index] = temp[1,0]

        key2_index =+ 1

    key1_index =+1

我很清楚使用两个 for 循环可能是效率最低的方法。

因此，对于如何加快计算速度的任何有用评论，我都非常感谢！

最佳

Answer 1

您可以查看 itertools，然后插入您的代码以计算在单个 for 循环中调用的函数 (compute_corr) 内的相关性：

import itertools
for key_1, key_2 in itertools.combinations(dict_corr, 2):
    correlation = compute_corr(key_1, key_2, dict_corr)
    #now store correlation in a list

如果您关心顺序，请使用 itertools.permutations(dict_corr, 2) 而不是组合。

编辑

因为你想要所有可能的键组合（也是一个键本身），你应该使用 itertools.product.

l_corr = [] #list to store all the output from the function
for key_1, key_2 in itertools.product(key_list, repeat= 2 ):
    l_corr.append(compute_corr(key_1, key_2, dict_corr))

现在 l_corr 会很长：len(key_list)*len(key_list)。您可以通过这种方式将此列表转换为矩阵：

np.array(l_corr).reshape(len(key_list),len(key_list))

虚拟示例:

def compute_corr(key_1, key_2, dict_corr):
    return key_1 * key_2 #dummy result from the function

dict_corr={1:"a",2:"b",3:"c",4:"d",5:"f"}
key_list = dict_corr.keys()

l_corr = []
for key_1, key_2 in itertools.product(key_list, repeat= 2 ):
    print(key_1, key_2)
    l_corr.append(compute_corr(key_1, key_2, dict_corr))

组合：

创建最终矩阵：

np.array(l_corr).reshape(len(key_list),len(key_list))

array([[ 1,  2,  3,  4,  5],
       [ 2,  4,  6,  8, 10],
       [ 3,  6,  9, 12, 15],
       [ 4,  8, 12, 16, 20],
       [ 5, 10, 15, 20, 25]])

如果我遗漏了什么，请告诉我。希望对您有所帮助

跨包含数千个相关矩阵的字典的高效计算

Efficient calculation across dictionary consisting of thousands of correlation matrizes

python

numpy

hierarchical-clustering

scipy

pandas