如何提取(几乎)完全由 NaN 组成的行和列?
How to withdraw rows and columns that are (almost) entirely made up of NaNs?
> matrix(c(c(0, 3.75882e-06, 3.71645e-05, 2.16088e-06, 1.357e-06, 1.19274e-06, NaN, 1.14748e-06, 9.3314e-07), c(3.75882e-06, 0, 3.94165e-05, 3.58464e-06, 3.60392e-06, 3.43881e-06, NaN, 3.39315e-06, 3.17616e-06), c(3.71645e-05, 3.94165e-05, 0, 3.78173e-05, 3.70121e-05, 3.68449e-05, NaN, 3.6798e-05, 3.65591e-05), c(2.16088e-06, 3.58464e-06, 3.78173e-05, 0, 2.00581e-06, 1.84085e-06, NaN, 1.79527e-06, 1.57976e-06), c(1.357e-06, 3.60392e-06, 3.70121e-05, 2.00581e-06, 0, 1.03709e-06, NaN, 9.91615e-07, 7.77135e-07), c(1.19274e-06, 3.43881e-06, 3.68449e-05, 1.84085e-06, 1.03709e-06, 0, NaN, 8.27333e-07, 6.12979e-07), c(NaN, NaN, NaN, NaN, NaN, NaN, 0, NaN, NaN), c(1.14748e-06, 3.39315e-06, 3.6798e-05, 1.79527e-06, 9.91615e-07, 8.27333e-07, NaN, 0, 5.67856e-07), c(9.3314e-07, 3.17616e-06, 3.65591e-05, 1.57976e-06, 7.77135e-07, 6.12979e-07, NaN, 5.67856e-07, 0)), ncol=9)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0.00000e+00 3.75882e-06 3.71645e-05 2.16088e-06 1.35700e-06 1.19274e-06 NaN 1.14748e-06 9.33140e-07
[2,] 3.75882e-06 0.00000e+00 3.94165e-05 3.58464e-06 3.60392e-06 3.43881e-06 NaN 3.39315e-06 3.17616e-06
[3,] 3.71645e-05 3.94165e-05 0.00000e+00 3.78173e-05 3.70121e-05 3.68449e-05 NaN 3.67980e-05 3.65591e-05
[4,] 2.16088e-06 3.58464e-06 3.78173e-05 0.00000e+00 2.00581e-06 1.84085e-06 NaN 1.79527e-06 1.57976e-06
[5,] 1.35700e-06 3.60392e-06 3.70121e-05 2.00581e-06 0.00000e+00 1.03709e-06 NaN 9.91615e-07 7.77135e-07
[6,] 1.19274e-06 3.43881e-06 3.68449e-05 1.84085e-06 1.03709e-06 0.00000e+00 NaN 8.27333e-07 6.12979e-07
[7,] NaN NaN NaN NaN NaN NaN 0 NaN NaN
[8,] 1.14748e-06 3.39315e-06 3.67980e-05 1.79527e-06 9.91615e-07 8.27333e-07 NaN 0.00000e+00 5.67856e-07
[9,] 9.33140e-07 3.17616e-06 3.65591e-05 1.57976e-06 7.77135e-07 6.12979e-07 NaN 5.67856e-07 0.00000e+00
我有一堆上述类型的矩阵。除了由 NaN 组成的某些行和列外,它们都填充有数字元素。在由 NaN 组成的行和列之间的交叉点上始终有一个零。请注意,在上面的示例中,只有一行和一列包含 NaN,但实际上我可能有多个这样的行和列。
我的目标是编写一个函数,自动删除几乎由 NaN 组成的行和列。我怎样才能做到这一点?
使用 rowSums 和 colSums 的逻辑索引(在正确的位置)给出了一个非常紧凑和有效的答案:
M[rowSums(is.na(M)) < 0.8*nrow(M), ][ , colSums(is.na(M))< 0.8*ncol(M)]
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00000e+00 3.75882e-06 3.71645e-05 2.16088e-06 1.35700e-06
[2,] 3.75882e-06 0.00000e+00 3.94165e-05 3.58464e-06 3.60392e-06
[3,] 3.71645e-05 3.94165e-05 0.00000e+00 3.78173e-05 3.70121e-05
[4,] 2.16088e-06 3.58464e-06 3.78173e-05 0.00000e+00 2.00581e-06
[5,] 1.35700e-06 3.60392e-06 3.70121e-05 2.00581e-06 0.00000e+00
[6,] 1.19274e-06 3.43881e-06 3.68449e-05 1.84085e-06 1.03709e-06
[7,] 1.14748e-06 3.39315e-06 3.67980e-05 1.79527e-06 9.91615e-07
[8,] 9.33140e-07 3.17616e-06 3.65591e-05 1.57976e-06 7.77135e-07
[,6] [,7] [,8]
[1,] 1.19274e-06 1.14748e-06 9.33140e-07
[2,] 3.43881e-06 3.39315e-06 3.17616e-06
[3,] 3.68449e-05 3.67980e-05 3.65591e-05
[4,] 1.84085e-06 1.79527e-06 1.57976e-06
[5,] 1.03709e-06 9.91615e-07 7.77135e-07
[6,] 0.00000e+00 8.27333e-07 6.12979e-07
[7,] 8.27333e-07 0.00000e+00 5.67856e-07
[8,] 6.12979e-07 5.67856e-07 0.00000e+00
甚至可以一步完成:
M[rowSums(is.na(M)) < 0.8*nrow(M), colSums(is.na(M))< 0.8*ncol(M)]
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00000e+00 3.75882e-06 3.71645e-05 2.16088e-06 1.35700e-06
[2,] 3.75882e-06 0.00000e+00 3.94165e-05 3.58464e-06 3.60392e-06
[3,] 3.71645e-05 3.94165e-05 0.00000e+00 3.78173e-05 3.70121e-05
[4,] 2.16088e-06 3.58464e-06 3.78173e-05 0.00000e+00 2.00581e-06
[5,] 1.35700e-06 3.60392e-06 3.70121e-05 2.00581e-06 0.00000e+00
[6,] 1.19274e-06 3.43881e-06 3.68449e-05 1.84085e-06 1.03709e-06
[7,] 1.14748e-06 3.39315e-06 3.67980e-05 1.79527e-06 9.91615e-07
[8,] 9.33140e-07 3.17616e-06 3.65591e-05 1.57976e-06 7.77135e-07
[,6] [,7] [,8]
[1,] 1.19274e-06 1.14748e-06 9.33140e-07
[2,] 3.43881e-06 3.39315e-06 3.17616e-06
[3,] 3.68449e-05 3.67980e-05 3.65591e-05
[4,] 1.84085e-06 1.79527e-06 1.57976e-06
[5,] 1.03709e-06 9.91615e-07 7.77135e-07
[6,] 0.00000e+00 8.27333e-07 6.12979e-07
[7,] 8.27333e-07 0.00000e+00 5.67856e-07
[8,] 6.12979e-07 5.67856e-07 0.00000e+00
如果您确定只比行数或列数少一,那么逻辑测试可能是 <= (nrow(M)-1)
和 <= (ncol(M)-1)
> matrix(c(c(0, 3.75882e-06, 3.71645e-05, 2.16088e-06, 1.357e-06, 1.19274e-06, NaN, 1.14748e-06, 9.3314e-07), c(3.75882e-06, 0, 3.94165e-05, 3.58464e-06, 3.60392e-06, 3.43881e-06, NaN, 3.39315e-06, 3.17616e-06), c(3.71645e-05, 3.94165e-05, 0, 3.78173e-05, 3.70121e-05, 3.68449e-05, NaN, 3.6798e-05, 3.65591e-05), c(2.16088e-06, 3.58464e-06, 3.78173e-05, 0, 2.00581e-06, 1.84085e-06, NaN, 1.79527e-06, 1.57976e-06), c(1.357e-06, 3.60392e-06, 3.70121e-05, 2.00581e-06, 0, 1.03709e-06, NaN, 9.91615e-07, 7.77135e-07), c(1.19274e-06, 3.43881e-06, 3.68449e-05, 1.84085e-06, 1.03709e-06, 0, NaN, 8.27333e-07, 6.12979e-07), c(NaN, NaN, NaN, NaN, NaN, NaN, 0, NaN, NaN), c(1.14748e-06, 3.39315e-06, 3.6798e-05, 1.79527e-06, 9.91615e-07, 8.27333e-07, NaN, 0, 5.67856e-07), c(9.3314e-07, 3.17616e-06, 3.65591e-05, 1.57976e-06, 7.77135e-07, 6.12979e-07, NaN, 5.67856e-07, 0)), ncol=9)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0.00000e+00 3.75882e-06 3.71645e-05 2.16088e-06 1.35700e-06 1.19274e-06 NaN 1.14748e-06 9.33140e-07
[2,] 3.75882e-06 0.00000e+00 3.94165e-05 3.58464e-06 3.60392e-06 3.43881e-06 NaN 3.39315e-06 3.17616e-06
[3,] 3.71645e-05 3.94165e-05 0.00000e+00 3.78173e-05 3.70121e-05 3.68449e-05 NaN 3.67980e-05 3.65591e-05
[4,] 2.16088e-06 3.58464e-06 3.78173e-05 0.00000e+00 2.00581e-06 1.84085e-06 NaN 1.79527e-06 1.57976e-06
[5,] 1.35700e-06 3.60392e-06 3.70121e-05 2.00581e-06 0.00000e+00 1.03709e-06 NaN 9.91615e-07 7.77135e-07
[6,] 1.19274e-06 3.43881e-06 3.68449e-05 1.84085e-06 1.03709e-06 0.00000e+00 NaN 8.27333e-07 6.12979e-07
[7,] NaN NaN NaN NaN NaN NaN 0 NaN NaN
[8,] 1.14748e-06 3.39315e-06 3.67980e-05 1.79527e-06 9.91615e-07 8.27333e-07 NaN 0.00000e+00 5.67856e-07
[9,] 9.33140e-07 3.17616e-06 3.65591e-05 1.57976e-06 7.77135e-07 6.12979e-07 NaN 5.67856e-07 0.00000e+00
我有一堆上述类型的矩阵。除了由 NaN 组成的某些行和列外,它们都填充有数字元素。在由 NaN 组成的行和列之间的交叉点上始终有一个零。请注意,在上面的示例中,只有一行和一列包含 NaN,但实际上我可能有多个这样的行和列。
我的目标是编写一个函数,自动删除几乎由 NaN 组成的行和列。我怎样才能做到这一点?
使用 rowSums 和 colSums 的逻辑索引(在正确的位置)给出了一个非常紧凑和有效的答案:
M[rowSums(is.na(M)) < 0.8*nrow(M), ][ , colSums(is.na(M))< 0.8*ncol(M)]
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00000e+00 3.75882e-06 3.71645e-05 2.16088e-06 1.35700e-06
[2,] 3.75882e-06 0.00000e+00 3.94165e-05 3.58464e-06 3.60392e-06
[3,] 3.71645e-05 3.94165e-05 0.00000e+00 3.78173e-05 3.70121e-05
[4,] 2.16088e-06 3.58464e-06 3.78173e-05 0.00000e+00 2.00581e-06
[5,] 1.35700e-06 3.60392e-06 3.70121e-05 2.00581e-06 0.00000e+00
[6,] 1.19274e-06 3.43881e-06 3.68449e-05 1.84085e-06 1.03709e-06
[7,] 1.14748e-06 3.39315e-06 3.67980e-05 1.79527e-06 9.91615e-07
[8,] 9.33140e-07 3.17616e-06 3.65591e-05 1.57976e-06 7.77135e-07
[,6] [,7] [,8]
[1,] 1.19274e-06 1.14748e-06 9.33140e-07
[2,] 3.43881e-06 3.39315e-06 3.17616e-06
[3,] 3.68449e-05 3.67980e-05 3.65591e-05
[4,] 1.84085e-06 1.79527e-06 1.57976e-06
[5,] 1.03709e-06 9.91615e-07 7.77135e-07
[6,] 0.00000e+00 8.27333e-07 6.12979e-07
[7,] 8.27333e-07 0.00000e+00 5.67856e-07
[8,] 6.12979e-07 5.67856e-07 0.00000e+00
甚至可以一步完成:
M[rowSums(is.na(M)) < 0.8*nrow(M), colSums(is.na(M))< 0.8*ncol(M)]
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00000e+00 3.75882e-06 3.71645e-05 2.16088e-06 1.35700e-06
[2,] 3.75882e-06 0.00000e+00 3.94165e-05 3.58464e-06 3.60392e-06
[3,] 3.71645e-05 3.94165e-05 0.00000e+00 3.78173e-05 3.70121e-05
[4,] 2.16088e-06 3.58464e-06 3.78173e-05 0.00000e+00 2.00581e-06
[5,] 1.35700e-06 3.60392e-06 3.70121e-05 2.00581e-06 0.00000e+00
[6,] 1.19274e-06 3.43881e-06 3.68449e-05 1.84085e-06 1.03709e-06
[7,] 1.14748e-06 3.39315e-06 3.67980e-05 1.79527e-06 9.91615e-07
[8,] 9.33140e-07 3.17616e-06 3.65591e-05 1.57976e-06 7.77135e-07
[,6] [,7] [,8]
[1,] 1.19274e-06 1.14748e-06 9.33140e-07
[2,] 3.43881e-06 3.39315e-06 3.17616e-06
[3,] 3.68449e-05 3.67980e-05 3.65591e-05
[4,] 1.84085e-06 1.79527e-06 1.57976e-06
[5,] 1.03709e-06 9.91615e-07 7.77135e-07
[6,] 0.00000e+00 8.27333e-07 6.12979e-07
[7,] 8.27333e-07 0.00000e+00 5.67856e-07
[8,] 6.12979e-07 5.67856e-07 0.00000e+00
如果您确定只比行数或列数少一,那么逻辑测试可能是 <= (nrow(M)-1)
和 <= (ncol(M)-1)