忽略 NaN 的相关矩阵
Correlation matrix ignoring NaN
我正在使用 matlab,我有一个 (60x882) 矩阵,我需要计算列之间的成对相关性。但是,我想忽略所有具有 NaN 或更多值的列(即,其中至少一个条目为 NaN 的任何一对列的结果应为 NaN)。
到目前为止,这是我的代码:
for i=1:size(auxret,2)
for j=1:size(auxret,2)
rho(i,j)=corr(auxret(:,i),auxret(:,j));
end
end
end
但这非常低效。我考虑过使用函数:
corr(auxret, 'rows','pairwise');
但它并没有产生相同的结果(它忽略了 NaN 但仍然计算相关性 - 所以除非一列的所有条目除了一个都是 NaN 它仍然会给出输出)。
有什么提高效率的建议吗?
要获得与使用 corr(auxret, 'rows','pairwise')
的代码相同的输出,请执行以下操作
auxret(:,any(isnan(auxret))) = NaN;
r = corr(auxret, 'rows','pairwise');
这将是一种有效的方法,特别是在处理涉及 NaNs
-
的输入数据时
%// Get mask of invalid columns and thus extract columns without any NaN
mask = any(isnan(auxret),1);
A = auxret(:,~mask);
%// Use correlation formula to get correlation outputs for valid columns
n = size(A,1);
sum_cols = sum(A,1);
sumsq_sqcolsum = n*sum(A.^2,1) - sum_cols.^2;
val1 = n.*(A.'*A) - bsxfun(@times,sum_cols.',sum_cols); %//'
val2 = sqrt(bsxfun(@times,sumsq_sqcolsum.',sumsq_sqcolsum)); %//'
valid_outvals = val1./val2;
%// Setup output array and store the valid outputs in it
ncols = size(auxret,2);
valid_idx = find(~mask);
out = nan(ncols);
out(valid_idx,valid_idx) = valid_outvals;
基本上,作为预处理步骤,它将所有具有一个或多个NaNs
的列一起删除并计算相关输出。然后我们用 适当的大小 初始化一个 NaNs
的输出数组,并在 适当的位置 .[=37 将有效输出放回其中=]
基准测试
无论您是继续使用循环方法还是使用可选 corr(auxret, 'rows','pairwise')
,结果似乎都是有效的。但是,这里有一个很大的问题:即使是一个 NaN
在任何列中都会大大降低性能,并且这种性能下降对于原始循环方法来说是巨大的,并且对于我们将要使用的 rows + pairwise
选项来说仍然很大
在接下来的基准测试结果中找出答案。
基准代码
nrows = 60;
ncols = 882;
percent_nans = 1; %// decides the percentage of NaNs in input
auxret = rand(nrows,ncols);
auxret(randperm(numel(auxret),round((percent_nans/100)*numel(auxret))))=nan;
disp('------------------------------- With Proposed Approach')
tic
%// Solution code from earlier
toc
disp('------------------------------- With ROWS + PAIRWISE Approach')
tic
auxret(:,any(isnan(auxret))) = NaN;
out1 = corr(auxret, 'rows','pairwise');
toc
disp('------------------------------- With Original Loopy Approach')
tic
out2 = zeros(size(auxret,2));
for i=1:size(auxret,2)
for j=1:size(auxret,2)
out2(i,j)=corr(auxret(:,i),auxret(:,j));
end
end
toc
因此,根据输入数据大小和 NaNs
的百分比,可能的情况很少,相应地我们得到了运行时结果 -
情况 1:输入为 6 x 88
,NaN 的百分比为 10
------------------------------- With Proposed Approach
Elapsed time is 0.006371 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.052563 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.875620 seconds.
情况 2:输入为 6 x 88
,NaN 的百分比为 1
------------------------------- With Proposed Approach
Elapsed time is 0.006303 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.049194 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.871369 seconds.
情况 3:输入为 6 x 88
,NaN 的百分比为 0.001
------------------------------- With Proposed Approach
Elapsed time is 0.006738 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.025754 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.867647 seconds.
情况 4:输入为 60 x 882
,NaN 的百分比为 10
------------------------------- With Proposed Approach
Elapsed time is 0.007766 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.479645 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...
情况 5:输入为 60 x 882
,NaN 的百分比为 1
------------------------------- With Proposed Approach
Elapsed time is 0.014144 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.324878 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...
案例 6:输入为 60 x 882
,NaN 的百分比为 0.001
------------------------------- With Proposed Approach
Elapsed time is 0.020410 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 1.830632 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...
您描述的是 corr
的默认行为,没有任何特殊选项。例如,
auxret = [8 2 3
3 5 NaN
7 10 3
7 4 6
2 6 7];
rho = corr(auxret)
结果
rho =
1.0000 -0.1497 NaN
-0.1497 1.0000 NaN
NaN NaN NaN
我正在使用 matlab,我有一个 (60x882) 矩阵,我需要计算列之间的成对相关性。但是,我想忽略所有具有 NaN 或更多值的列(即,其中至少一个条目为 NaN 的任何一对列的结果应为 NaN)。
到目前为止,这是我的代码:
for i=1:size(auxret,2)
for j=1:size(auxret,2)
rho(i,j)=corr(auxret(:,i),auxret(:,j));
end
end
end
但这非常低效。我考虑过使用函数:
corr(auxret, 'rows','pairwise'); 但它并没有产生相同的结果(它忽略了 NaN 但仍然计算相关性 - 所以除非一列的所有条目除了一个都是 NaN 它仍然会给出输出)。
有什么提高效率的建议吗?
要获得与使用 corr(auxret, 'rows','pairwise')
的代码相同的输出,请执行以下操作
auxret(:,any(isnan(auxret))) = NaN;
r = corr(auxret, 'rows','pairwise');
这将是一种有效的方法,特别是在处理涉及 NaNs
-
%// Get mask of invalid columns and thus extract columns without any NaN
mask = any(isnan(auxret),1);
A = auxret(:,~mask);
%// Use correlation formula to get correlation outputs for valid columns
n = size(A,1);
sum_cols = sum(A,1);
sumsq_sqcolsum = n*sum(A.^2,1) - sum_cols.^2;
val1 = n.*(A.'*A) - bsxfun(@times,sum_cols.',sum_cols); %//'
val2 = sqrt(bsxfun(@times,sumsq_sqcolsum.',sumsq_sqcolsum)); %//'
valid_outvals = val1./val2;
%// Setup output array and store the valid outputs in it
ncols = size(auxret,2);
valid_idx = find(~mask);
out = nan(ncols);
out(valid_idx,valid_idx) = valid_outvals;
基本上,作为预处理步骤,它将所有具有一个或多个NaNs
的列一起删除并计算相关输出。然后我们用 适当的大小 初始化一个 NaNs
的输出数组,并在 适当的位置 .[=37 将有效输出放回其中=]
基准测试
无论您是继续使用循环方法还是使用可选 corr(auxret, 'rows','pairwise')
,结果似乎都是有效的。但是,这里有一个很大的问题:即使是一个 NaN
在任何列中都会大大降低性能,并且这种性能下降对于原始循环方法来说是巨大的,并且对于我们将要使用的 rows + pairwise
选项来说仍然很大
在接下来的基准测试结果中找出答案。
基准代码
nrows = 60;
ncols = 882;
percent_nans = 1; %// decides the percentage of NaNs in input
auxret = rand(nrows,ncols);
auxret(randperm(numel(auxret),round((percent_nans/100)*numel(auxret))))=nan;
disp('------------------------------- With Proposed Approach')
tic
%// Solution code from earlier
toc
disp('------------------------------- With ROWS + PAIRWISE Approach')
tic
auxret(:,any(isnan(auxret))) = NaN;
out1 = corr(auxret, 'rows','pairwise');
toc
disp('------------------------------- With Original Loopy Approach')
tic
out2 = zeros(size(auxret,2));
for i=1:size(auxret,2)
for j=1:size(auxret,2)
out2(i,j)=corr(auxret(:,i),auxret(:,j));
end
end
toc
因此,根据输入数据大小和 NaNs
的百分比,可能的情况很少,相应地我们得到了运行时结果 -
情况 1:输入为 6 x 88
,NaN 的百分比为 10
------------------------------- With Proposed Approach
Elapsed time is 0.006371 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.052563 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.875620 seconds.
情况 2:输入为 6 x 88
,NaN 的百分比为 1
------------------------------- With Proposed Approach
Elapsed time is 0.006303 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.049194 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.871369 seconds.
情况 3:输入为 6 x 88
,NaN 的百分比为 0.001
------------------------------- With Proposed Approach
Elapsed time is 0.006738 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.025754 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.867647 seconds.
情况 4:输入为 60 x 882
,NaN 的百分比为 10
------------------------------- With Proposed Approach
Elapsed time is 0.007766 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.479645 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...
情况 5:输入为 60 x 882
,NaN 的百分比为 1
------------------------------- With Proposed Approach
Elapsed time is 0.014144 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.324878 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...
案例 6:输入为 60 x 882
,NaN 的百分比为 0.001
------------------------------- With Proposed Approach
Elapsed time is 0.020410 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 1.830632 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...
您描述的是 corr
的默认行为,没有任何特殊选项。例如,
auxret = [8 2 3
3 5 NaN
7 10 3
7 4 6
2 6 7];
rho = corr(auxret)
结果
rho =
1.0000 -0.1497 NaN
-0.1497 1.0000 NaN
NaN NaN NaN