忽略 NaN 的相关矩阵

Correlation matrix ignoring NaN

我正在使用 matlab,我有一个 (60x882) 矩阵,我需要计算列之间的成对相关性。但是,我想忽略所有具有 NaN 或更多值的列(即,其中至少一个条目为 NaN 的任何一对列的结果应为 NaN)。

到目前为止,这是我的代码:

for i=1:size(auxret,2)
    for j=1:size(auxret,2)
        rho(i,j)=corr(auxret(:,i),auxret(:,j));
        end
    end
end

但这非常低效。我考虑过使用函数:

corr(auxret, 'rows','pairwise'); 但它并没有产生相同的结果(它忽略了 NaN 但仍然计算相关性 - 所以除非一列的所有条目除了一个都是 NaN 它仍然会给出输出)。

有什么提高效率的建议吗?

要获得与使用 corr(auxret, 'rows','pairwise') 的代码相同的输出,请执行以下操作

auxret(:,any(isnan(auxret))) = NaN;
r = corr(auxret, 'rows','pairwise');

这将是一种有效的方法,特别是在处理涉及 NaNs -

的输入数据时
%// Get mask of invalid columns and thus extract columns without any NaN
mask = any(isnan(auxret),1);
A = auxret(:,~mask);

%// Use correlation formula to get correlation outputs for valid columns
n = size(A,1);
sum_cols = sum(A,1);
sumsq_sqcolsum = n*sum(A.^2,1) - sum_cols.^2;

val1 = n.*(A.'*A) - bsxfun(@times,sum_cols.',sum_cols);      %//'
val2 = sqrt(bsxfun(@times,sumsq_sqcolsum.',sumsq_sqcolsum)); %//'
valid_outvals = val1./val2;

%// Setup output array and store the valid outputs in it
ncols = size(auxret,2);
valid_idx = find(~mask);
out = nan(ncols);
out(valid_idx,valid_idx) = valid_outvals;

基本上,作为预处理步骤,它将所有具有一个或多个NaNs的列一起删除并计算相关输出。然后我们用 适当的大小 初始化一个 NaNs 的输出数组,并在 适当的位置 .[=37 将有效输出放回其中=]


基准测试

无论您是继续使用循环方法还是使用可选 corr(auxret, 'rows','pairwise'),结果似乎都是有效的。但是,这里有一个很大的问题:即使是一个 NaN 在任何列中都会大大降低性能,并且这种性能下降对于原始循环方法来说是巨大的,并且对于我们将要使用的 rows + pairwise 选项来说仍然很大 在接下来的基准测试结果中找出答案。

基准代码

nrows = 60;
ncols = 882;
percent_nans = 1; %// decides the percentage of NaNs in input

auxret = rand(nrows,ncols);
auxret(randperm(numel(auxret),round((percent_nans/100)*numel(auxret))))=nan;

disp('------------------------------- With Proposed Approach')
tic
%// Solution code from earlier
toc

disp('------------------------------- With ROWS + PAIRWISE Approach')
tic
auxret(:,any(isnan(auxret))) = NaN;
out1 = corr(auxret, 'rows','pairwise');
toc

disp('------------------------------- With Original Loopy Approach')
tic
out2 = zeros(size(auxret,2));
for i=1:size(auxret,2)
    for j=1:size(auxret,2)
        out2(i,j)=corr(auxret(:,i),auxret(:,j));
    end
end
toc

因此,根据输入数据大小和 NaNs 的百分比,可能的情况很少,相应地我们得到了运行时结果 -

情况 1:输入为 6 x 88,NaN 的百分比为 10

------------------------------- With Proposed Approach
Elapsed time is 0.006371 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.052563 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.875620 seconds.

情况 2:输入为 6 x 88,NaN 的百分比为 1

------------------------------- With Proposed Approach
Elapsed time is 0.006303 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.049194 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.871369 seconds.

情况 3:输入为 6 x 88,NaN 的百分比为 0.001

------------------------------- With Proposed Approach
Elapsed time is 0.006738 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.025754 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.867647 seconds.

情况 4:输入为 60 x 882,NaN 的百分比为 10

------------------------------- With Proposed Approach
Elapsed time is 0.007766 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.479645 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...

情况 5:输入为 60 x 882,NaN 的百分比为 1

------------------------------- With Proposed Approach
Elapsed time is 0.014144 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.324878 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...

案例 6:输入为 60 x 882,NaN 的百分比为 0.001

------------------------------- With Proposed Approach
Elapsed time is 0.020410 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 1.830632 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...

您描述的是 corr 的默认行为,没有任何特殊选项。例如,

auxret =  [8     2     3
           3     5     NaN
           7    10     3
           7     4     6
           2     6     7];

rho = corr(auxret)

结果

rho =

    1.0000   -0.1497       NaN
   -0.1497    1.0000       NaN
       NaN       NaN       NaN