如何运行 Wilcox 测试 R 中大量特征的所有组组合？

Question

我有一个很大的稀疏矩阵（我们称之为 matrix），其中行是特征，列是样本。每个 column/sample 属于 6 个组中的 1 个。我从每组中随机抽取一些数量，并将它们所属的索引存储在原始矩阵中。

astro_index <- Map(sample,row_index, num_sample)[1] 
endo_index <- Map(sample,row_index, num_sample)[2] 
micro_index <- Map(sample,row_index, num_sample)[3]
neuron_index <- Map(sample,row_index, num_sample)[4] 
oligo_index <- Map(sample,row_index, num_sample)[5] 
opc_index <- Map(sample,row_index, num_sample)[6]

目标是能够执行 Wilcox 检验并获得每个特征的 6 组所有组合的 p 值。最大的问题是我有超过 30,000 个功能要测试 6 组的所有组合（因此对 30,000 多个功能中的每一个进行 15 次比较）。

所以我目前有两种方法。 first 一个使用 apply 函数并且只进行一次比较（这里是 astro 和神经元组）。这种方法的缺点是我运行遇到内存问题并且它一次只进行 1 次比较。我将不得不再写 14 次才能得到所有可能的比较。

store_p <- apply(matrix,1,function(x) {wilcox.test(x[astro_index],x[neuron_index])$p.value })

second 方法使用 for 循环遍历所有特征，但我利用 combn 和数据框来计算除一个组合之外的所有组合的 p 值一次功能。这个方法确实很慢但是不会崩溃

for (i in features){

  df <- data.frame('Astro' = matrix[i,astro_index], 'Endo' = matrix[i,endo_index], 'Micro' = matrix[i,micro_index], 'Neuron' = matrix[i,neuron_index], 'Oligo' = matrix[i,oligo_index], 'OPC' = matrix[i,opc_index])
  result <- combn(names(df), 2, FUN = function(x) paste(paste(x, collapse='-'), wilcox.test(df[,x[1]], df[,x[2]], paired = TRUE)$p.value, sep=" : ")) 
  hold_list <- append(hold_list, list(result))

}

让大家了解 result 的样子。这是 result

的示例输出

> result
 [1] "Astro-Endo : 0.115331575924872"      "Astro-Micro : 0.935664046257304"     "Astro-Neuron : 0.0271849565394441"  
 [4] "Astro-Oligo : 0.00147694402781699"   "Astro-OPC : 0.0476580762532988"      "Endo-Micro : 0.297672151508384"     
 [7] "Endo-Neuron : 2.38134038927696e-06"  "Endo-Oligo : 0.0323129112432441"     "Endo-OPC : 0.451258974150342"       
[10] "Micro-Neuron : 0.000143621746738224" "Micro-Oligo : 0.0178171887595787"    "Micro-OPC : 0.0692129715131915"     
[13] "Neuron-Oligo : 6.68255453156116e-10" "Neuron-OPC : 6.201108273594e-07"     "Oligo-OPC : 0.142213241936393"

理想情况下，我会喜欢两种方法中最好的两个世界，并做一个更有效的过程来计算这些测试。另外，如果有人建议一起设计一个不同的数据框架来以一种方式解决这个任务，我也会很感激。

编辑我意识到我没有说清楚，但 result 仅适用于所有组合的一个特征。我有一个 for 循环，以便它遍历所有功能。本质上，应该为所有特征和所有组合计算一个 p 值。

Answer 1

如果我理解正确，这个问题可以归结为计算一组特征的相似性度量。我会使用 proxy 库来解决它，尤其是 proxy::dist 函数。它允许使用自定义相似性度量。

我生成了一个合成数据集来说明这个过程：

library(proxy)
set.seed(123)

feature_df <- data.frame(matrix(runif(n = 1000^2), 1000, 1000))
names(feature_df) <- paste0("feature_", names(feature_df))

feature_set_1 <- sample(x = 1:1000, 10)
feature_set_2 <- sample(x = 1:1000, 10)
feature_set_3 <- sample(x = 1:1000, 10)

接下来，我们定义我们的函数，运行它在所有特征集上成对：

test_fun <- function(x,y) {wilcox.test(x,y)$p.value}

fs <- c(feature_set_1, feature_set_2, feature_set_3)  
k <- 1
dist_mat <- vector(mode="list")
for (i in seq_along(fs)){
  for (j in 1:i){
    if (i>j) {
      dist_mat[[k]] <- proxy::dist(feature_df[fs[[i]], ], feature_df[fs[[j]], ], method = test_fun, by_rows = TRUE)
      k <- k+1
    }
  }
}

dist_mat 矩阵包含感兴趣的数量。 i 和 j 指的是特征集。 by_rows 用于指示 dist 按行计算指标。

Answer 2

为此，我会使用 scran 中的 pairwiseWilcox - 这似乎非常适合您的问题。它对列组之间的每一行执行成对的 Wilcoxon 秩和检验，其中组是列分配的向量。

编辑：

抽样组具有相同数量的元素（列），因为 OP 似乎想要这样。
减少矩阵的稀疏性，使其更清楚地表明它不比较单个值，而是比较每行的值组。

示例：

library(Matrix)
types <- c("Astro", "Neuron", "Endo", "Oligo", "OPC", "Micro")

# generate sparse matrix
set.seed(123)
mat <- Matrix(0, nrow = 10000, ncol = 1000, sparse = TRUE)
mat[sample(seq_along(mat), 1E5)] <- runif(n = 1e5, min = 0, max=100)
groups <- c(rep(types, each = floor(ncol(mat)/6)), rep("Micro", ncol(mat) %% 6))
colnames(mat) <- make.unique(groups)

# sample n=100 samples of each group
idx <- setNames(lapply(types, function(x) grep(x, colnames(mat))), types)
smp <- Map(sample, idx, size = 100)
groups <- gsub("[0-9]+", "", names(unlist(smp)))

# subset mat to sampled columns
mat <- mat[, unlist(smp, use.names = FALSE)]

library(scran)

pwt <- pairwiseWilcox(mat, groups = groups)
pwt
#> $statistics
#> $statistics[[1]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.49995  1.000000  1.000000
#> 2       0.51000  0.158341  0.616668
#> 3       0.49000  0.158341  0.616668
#> 4       0.50490  0.573540  0.856541
#> 5       0.48985  0.308851  0.616668
#> ...         ...       ...       ...
#> 9996     0.4950  0.565662  0.856541
#> 9997     0.5050  0.322174  0.616668
#> 9998     0.4951  0.573540  0.856541
#> 9999     0.4950  0.322174  0.616668
#> 10000    0.5050  0.322174  0.616668
#> 
#> $statistics[[2]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.5050 0.3221741  0.613045
#> 2        0.5049 0.5735395  0.858464
#> 3        0.4800 0.0444225  0.613045
#> 4        0.4947 0.6352736  0.948311
#> 5        0.4949 0.5578376  0.858464
#> ...         ...       ...       ...
#> 9996    0.49500  0.565662  0.858464
#> 9997    0.50005  1.000000  1.000000
#> 9998    0.50500  0.322174  0.613045
#> 9999    0.50000  1.000000  1.000000
#> 10000   0.50500  0.322174  0.613045
#> 
#> $statistics[[3]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.5050  0.322174  0.605697
#> 2        0.5001  0.995980  1.000000
#> 3        0.5000  1.000000  1.000000
#> 4        0.5100  0.158341  0.605697
#> 5        0.4949  0.557838  0.854499
#> ...         ...       ...       ...
#> 9996    0.50005  1.000000  1.000000
#> 9997    0.49995  1.000000  1.000000
#> 9998    0.50005  1.000000  1.000000
#> 9999    0.49500  0.322174  0.605697
#> 10000   0.49995  1.000000  1.000000
#> 
#> $statistics[[4]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.49995  1.000000  1.000000
#> 2       0.50010  0.995980  1.000000
#> 3       0.50000  1.000000  1.000000
#> 4       0.49490  0.648212  0.959177
#> 5       0.50500  0.322174  0.615026
#> ...         ...       ...       ...
#> 9996     0.4949  0.557838  0.859750
#> 9997     0.4951  0.573540  0.859750
#> 9998     0.4852  0.182661  0.615026
#> 9999     0.5000  1.000000  1.000000
#> 10000    0.4949  0.557838  0.859750
#> 
#> $statistics[[5]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.50500  0.322174  0.620334
#> 2       0.49480  0.641729  0.964426
#> 3       0.50000  1.000000  1.000000
#> 4       0.51000  0.158341  0.620334
#> 5       0.49995  1.000000  1.000000
#> ...         ...       ...       ...
#> 9996    0.50005  1.000000  1.000000
#> 9997    0.49015  0.323442  0.620334
#> 9998    0.50005  1.000000  1.000000
#> 9999    0.49500  0.322174  0.620334
#> 10000   0.50500  0.322174  0.620334
#> 
#> $statistics[[6]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.50005  1.000000  1.000000
#> 2       0.49000  0.158341  0.616668
#> 3       0.51000  0.158341  0.616668
#> 4       0.49510  0.573540  0.856541
#> 5       0.51015  0.308851  0.616668
#> ...         ...       ...       ...
#> 9996     0.5050  0.565662  0.856541
#> 9997     0.4950  0.322174  0.616668
#> 9998     0.5049  0.573540  0.856541
#> 9999     0.5050  0.322174  0.616668
#> 10000    0.4950  0.322174  0.616668
#> 
#> $statistics[[7]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.50500  0.322174  0.616668
#> 2       0.49500  0.322174  0.616668
#> 3       0.48960  0.392127  0.746909
#> 4       0.49005  0.318530  0.616668
#> 5       0.50500  0.654721  0.960283
#> ...         ...       ...       ...
#> 9996     0.5001  0.995980  1.000000
#> 9997     0.4950  0.322174  0.616668
#> 9998     0.5100  0.158341  0.616668
#> 9999     0.5050  0.322174  0.616668
#> 10000    0.5000  1.000000  1.000000
#> 
#> $statistics[[8]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1         0.505  0.322174  0.604226
#> 2         0.490  0.158341  0.604226
#> 3         0.510  0.158341  0.604226
#> 4         0.505  0.322174  0.604226
#> 5         0.505  0.654721  0.952044
#> ...         ...       ...       ...
#> 9996    0.50510  0.557838  0.849437
#> 9997    0.49500  0.322174  0.604226
#> 9998    0.50500  0.565662  0.849437
#> 9999    0.49995  1.000000  1.000000
#> 10000   0.49500  0.322174  0.604226
#> 
#> $statistics[[9]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.49995  1.000000  1.000000
#> 2       0.49000  0.158341  0.611076
#> 3       0.51000  0.158341  0.611076
#> 4       0.49005  0.318530  0.611076
#> 5       0.51500  0.082748  0.611076
#> ...         ...       ...       ...
#> 9996     0.5000  1.000000  1.000000
#> 9997     0.4900  0.158341  0.611076
#> 9998     0.4899  0.405995  0.762863
#> 9999     0.5050  0.322174  0.611076
#> 10000    0.4900  0.158341  0.611076
#> 
#> $statistics[[10]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.50500  0.322174  0.619147
#> 2       0.48500  0.082748  0.619147
#> 3       0.51000  0.158341  0.619147
#> 4       0.50500  0.322174  0.619147
#> 5       0.50985  0.323442  0.619147
#> ...         ...       ...       ...
#> 9996    0.50500  0.565662  0.863244
#> 9997    0.48500  0.082748  0.619147
#> 9998    0.50510  0.557838  0.863244
#> 9999    0.50005  1.000000  1.000000
#> 10000   0.50000  1.000000  1.000000
#> 
#> $statistics[[11]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.4950 0.3221741  0.613045
#> 2        0.4951 0.5735395  0.858464
#> 3        0.5200 0.0444225  0.613045
#> 4        0.5053 0.6352736  0.948311
#> 5        0.5051 0.5578376  0.858464
#> ...         ...       ...       ...
#> 9996    0.50500  0.565662  0.858464
#> 9997    0.49995  1.000000  1.000000
#> 9998    0.49500  0.322174  0.613045
#> 9999    0.50000  1.000000  1.000000
#> 10000   0.49500  0.322174  0.613045
#> 
#> $statistics[[12]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.49500  0.322174  0.616668
#> 2       0.50500  0.322174  0.616668
#> 3       0.51040  0.392127  0.746909
#> 4       0.50995  0.318530  0.616668
#> 5       0.49500  0.654721  0.960283
#> ...         ...       ...       ...
#> 9996     0.4999  0.995980  1.000000
#> 9997     0.5050  0.322174  0.616668
#> 9998     0.4900  0.158341  0.616668
#> 9999     0.4950  0.322174  0.616668
#> 10000    0.5000  1.000000  1.000000
#> 
#> $statistics[[13]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.5000 1.0000000  1.000000
#> 2        0.4951 0.5735395  0.868341
#> 3        0.5200 0.0444225  0.625009
#> 4        0.5150 0.0827480  0.625009
#> 5        0.5000 1.0000000  1.000000
#> ...         ...       ...       ...
#> 9996    0.50500  0.565662  0.868341
#> 9997    0.49995  1.000000  1.000000
#> 9998    0.49500  0.322174  0.625009
#> 9999    0.49500  0.322174  0.625009
#> 10000   0.49500  0.322174  0.625009
#> 
#> $statistics[[14]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.49500 0.3221741  0.606038
#> 2       0.49510 0.5735395  0.852213
#> 3       0.52000 0.0444225  0.606038
#> 4       0.50005 1.0000000  1.000000
#> 5       0.51000 0.1583409  0.606038
#> ...         ...       ...       ...
#> 9996     0.4998 0.9879417  1.000000
#> 9997     0.4951 0.5735395  0.852213
#> 9998     0.4800 0.0444225  0.606038
#> 9999     0.5000 1.0000000  1.000000
#> 10000    0.4900 0.1583409  0.606038
#> 
#> $statistics[[15]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.50000 1.0000000  1.000000
#> 2       0.49005 0.3185296  0.619978
#> 3       0.52000 0.0444225  0.619978
#> 4       0.51500 0.0827480  0.619978
#> 5       0.50500 0.5656624  0.863114
#> ...         ...       ...       ...
#> 9996    0.50500  0.565662  0.863114
#> 9997    0.49015  0.323442  0.619978
#> 9998    0.49500  0.322174  0.619978
#> 9999    0.49500  0.322174  0.619978
#> 10000   0.50000  1.000000  1.000000
#> 
#> $statistics[[16]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.4950  0.322174  0.605697
#> 2        0.4999  0.995980  1.000000
#> 3        0.5000  1.000000  1.000000
#> 4        0.4900  0.158341  0.605697
#> 5        0.5051  0.557838  0.854499
#> ...         ...       ...       ...
#> 9996    0.49995  1.000000  1.000000
#> 9997    0.50005  1.000000  1.000000
#> 9998    0.49995  1.000000  1.000000
#> 9999    0.50500  0.322174  0.605697
#> 10000   0.50005  1.000000  1.000000
#> 
#> $statistics[[17]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1         0.495  0.322174  0.604226
#> 2         0.510  0.158341  0.604226
#> 3         0.490  0.158341  0.604226
#> 4         0.495  0.322174  0.604226
#> 5         0.495  0.654721  0.952044
#> ...         ...       ...       ...
#> 9996    0.49490  0.557838  0.849437
#> 9997    0.50500  0.322174  0.604226
#> 9998    0.49500  0.565662  0.849437
#> 9999    0.50005  1.000000  1.000000
#> 10000   0.50500  0.322174  0.604226
#> 
#> $statistics[[18]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.5000 1.0000000  1.000000
#> 2        0.5049 0.5735395  0.868341
#> 3        0.4800 0.0444225  0.625009
#> 4        0.4850 0.0827480  0.625009
#> 5        0.5000 1.0000000  1.000000
#> ...         ...       ...       ...
#> 9996    0.49500  0.565662  0.868341
#> 9997    0.50005  1.000000  1.000000
#> 9998    0.50500  0.322174  0.625009
#> 9999    0.50500  0.322174  0.625009
#> 10000   0.50500  0.322174  0.625009
#> 
#> $statistics[[19]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.4950  0.322174  0.619978
#> 2        0.4999  0.995980  1.000000
#> 3        0.5000  1.000000  1.000000
#> 4        0.4850  0.082748  0.619978
#> 5        0.5100  0.158341  0.619978
#> ...         ...       ...       ...
#> 9996     0.4949  0.557838  0.863504
#> 9997     0.4951  0.573540  0.863504
#> 9998     0.4850  0.176800  0.619978
#> 9999     0.5050  0.322174  0.619978
#> 10000    0.4949  0.557838  0.863504
#> 
#> $statistics[[20]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.5000  1.000000  1.000000
#> 2        0.4947  0.635274  0.958759
#> 3        0.5000  1.000000  1.000000
#> 4        0.5000  1.000000  1.000000
#> 5        0.5050  0.565662  0.869131
#> ...         ...       ...       ...
#> 9996    0.49995  1.000000  1.000000
#> 9997    0.49015  0.323442  0.625372
#> 9998    0.50005  1.000000  1.000000
#> 9999    0.50005  1.000000  1.000000
#> 10000   0.50500  0.322174  0.625372
#> 
#> $statistics[[21]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.50005  1.000000  1.000000
#> 2       0.49990  0.995980  1.000000
#> 3       0.50000  1.000000  1.000000
#> 4       0.50510  0.648212  0.959177
#> 5       0.49500  0.322174  0.615026
#> ...         ...       ...       ...
#> 9996     0.5051  0.557838  0.859750
#> 9997     0.5049  0.573540  0.859750
#> 9998     0.5148  0.182661  0.615026
#> 9999     0.5000  1.000000  1.000000
#> 10000    0.5051  0.557838  0.859750
#> 
#> $statistics[[22]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.50005  1.000000  1.000000
#> 2       0.51000  0.158341  0.611076
#> 3       0.49000  0.158341  0.611076
#> 4       0.50995  0.318530  0.611076
#> 5       0.48500  0.082748  0.611076
#> ...         ...       ...       ...
#> 9996     0.5000  1.000000  1.000000
#> 9997     0.5100  0.158341  0.611076
#> 9998     0.5101  0.405995  0.762863
#> 9999     0.4950  0.322174  0.611076
#> 10000    0.5100  0.158341  0.611076
#> 
#> $statistics[[23]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.50500 0.3221741  0.606038
#> 2       0.50490 0.5735395  0.852213
#> 3       0.48000 0.0444225  0.606038
#> 4       0.49995 1.0000000  1.000000
#> 5       0.49000 0.1583409  0.606038
#> ...         ...       ...       ...
#> 9996     0.5002 0.9879417  1.000000
#> 9997     0.5049 0.5735395  0.852213
#> 9998     0.5200 0.0444225  0.606038
#> 9999     0.5000 1.0000000  1.000000
#> 10000    0.5100 0.1583409  0.606038
#> 
#> $statistics[[24]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.5050  0.322174  0.619978
#> 2        0.5001  0.995980  1.000000
#> 3        0.5000  1.000000  1.000000
#> 4        0.5150  0.082748  0.619978
#> 5        0.4900  0.158341  0.619978
#> ...         ...       ...       ...
#> 9996     0.5051  0.557838  0.863504
#> 9997     0.5049  0.573540  0.863504
#> 9998     0.5150  0.176800  0.619978
#> 9999     0.4950  0.322174  0.619978
#> 10000    0.5051  0.557838  0.863504
#> 
#> $statistics[[25]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.5050  0.322174  0.618555
#> 2        0.4947  0.635274  0.954724
#> 3        0.5000  1.000000  1.000000
#> 4        0.5150  0.082748  0.618555
#> 5        0.4950  0.322174  0.618555
#> ...         ...       ...       ...
#> 9996     0.5051  0.557838  0.864937
#> 9997     0.4948  0.641729  0.961248
#> 9998     0.5152  0.171079  0.618555
#> 9999     0.4950  0.322174  0.618555
#> 10000    0.5100  0.158341  0.618555
#> 
#> $statistics[[26]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.49500  0.322174  0.620334
#> 2       0.50520  0.641729  0.964426
#> 3       0.50000  1.000000  1.000000
#> 4       0.49000  0.158341  0.620334
#> 5       0.50005  1.000000  1.000000
#> ...         ...       ...       ...
#> 9996    0.49995  1.000000  1.000000
#> 9997    0.50985  0.323442  0.620334
#> 9998    0.49995  1.000000  1.000000
#> 9999    0.50500  0.322174  0.620334
#> 10000   0.49500  0.322174  0.620334
#> 
#> $statistics[[27]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.49500  0.322174  0.619147
#> 2       0.51500  0.082748  0.619147
#> 3       0.49000  0.158341  0.619147
#> 4       0.49500  0.322174  0.619147
#> 5       0.49015  0.323442  0.619147
#> ...         ...       ...       ...
#> 9996    0.49500  0.565662  0.863244
#> 9997    0.51500  0.082748  0.619147
#> 9998    0.49490  0.557838  0.863244
#> 9999    0.49995  1.000000  1.000000
#> 10000   0.50000  1.000000  1.000000
#> 
#> $statistics[[28]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1       0.50000 1.0000000  1.000000
#> 2       0.50995 0.3185296  0.619978
#> 3       0.48000 0.0444225  0.619978
#> 4       0.48500 0.0827480  0.619978
#> 5       0.49500 0.5656624  0.863114
#> ...         ...       ...       ...
#> 9996    0.49500  0.565662  0.863114
#> 9997    0.50985  0.323442  0.619978
#> 9998    0.50500  0.322174  0.619978
#> 9999    0.50500  0.322174  0.619978
#> 10000   0.50000  1.000000  1.000000
#> 
#> $statistics[[29]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.5000  1.000000  1.000000
#> 2        0.5053  0.635274  0.958759
#> 3        0.5000  1.000000  1.000000
#> 4        0.5000  1.000000  1.000000
#> 5        0.4950  0.565662  0.869131
#> ...         ...       ...       ...
#> 9996    0.50005  1.000000  1.000000
#> 9997    0.50985  0.323442  0.625372
#> 9998    0.49995  1.000000  1.000000
#> 9999    0.49995  1.000000  1.000000
#> 10000   0.49500  0.322174  0.625372
#> 
#> $statistics[[30]]
#> DataFrame with 10000 rows and 3 columns
#>             AUC   p.value       FDR
#>       <numeric> <numeric> <numeric>
#> 1        0.4950  0.322174  0.618555
#> 2        0.5053  0.635274  0.954724
#> 3        0.5000  1.000000  1.000000
#> 4        0.4850  0.082748  0.618555
#> 5        0.5050  0.322174  0.618555
#> ...         ...       ...       ...
#> 9996     0.4949  0.557838  0.864937
#> 9997     0.5052  0.641729  0.961248
#> 9998     0.4848  0.171079  0.618555
#> 9999     0.5050  0.322174  0.618555
#> 10000    0.4900  0.158341  0.618555
#> 
#> 
#> $pairs
#> DataFrame with 30 rows and 2 columns
#>           first      second
#>     <character> <character>
#> 1         Astro        Endo
#> 2         Astro       Micro
#> 3         Astro      Neuron
#> 4         Astro       Oligo
#> 5         Astro         OPC
#> ...         ...         ...
#> 26          OPC       Astro
#> 27          OPC        Endo
#> 28          OPC       Micro
#> 29          OPC      Neuron
#> 30          OPC       Oligo

^{由 reprex package (v0.3.0)}

于 2020-06-18 创建

编辑#2：

看看我的方法会给你带来什么（至少在我的机器上）StupidWolf 方法的 1/20,000，用他的例子 mat 和 [=15 试试这个=]:

set.seed(111)
celltypes = c("astro","endo","micro","neuron","oligo","opc")
mat = matrix(rnorm(10000*120),ncol=120)
colnames(mat) = paste0("cell",1:120)
rownames(mat) = paste0("gene",1:10000)
metadata = data.frame(celltype=rep(celltypes,each=20))
num_sample = 10
use_cols = tapply(1:nrow(metadata),metadata$celltype,sample,num_sample)
use_cols = unlist(use_cols)
group = metadata$celltype[use_cols]

library(scran)
library(data.table)
pwt <- pairwiseWilcox(mat[,use_cols], groups=group)
unique_comps <- !duplicated(t(apply(pwt$pairs, 1, sort)))
res <- rbindlist(setNames(lapply(pwt$statistics[unique_comps], 
                                 function(x) as.data.table(x, keep.rownames=TRUE)), 
                          apply(pwt$pairs[unique_comps,], 1, paste, collapse = '_')),
                 idcol = "comparison")[, .(comparison, rn, p.value)]

setnames(res, "rn", "gene")
res[gene=="gene999"]
#>       comparison    gene   p.value
#>  1:   astro_endo gene999 0.1858767
#>  2:  astro_micro gene999 0.5707504
#>  3: astro_neuron gene999 0.3846731
#>  4:  astro_oligo gene999 0.4273553
#>  5:    astro_opc gene999 0.3846731
#>  6:   endo_micro gene999 0.4726756
#>  7:  endo_neuron gene999 0.9097219
#>  8:   endo_oligo gene999 0.6231762
#>  9:     endo_opc gene999 0.6775850
#> 10: micro_neuron gene999 0.9097219
#> 11:  micro_oligo gene999 0.9097219
#> 12:    micro_opc gene999 0.9097219
#> 13: neuron_oligo gene999 0.8501067
#> 14:   neuron_opc gene999 0.6775850
#> 15:    oligo_opc gene999 0.9698500

^{由 reprex package (v0.3.0)}

于 2020-06-19 创建

Answer 3

对于不熟悉此类数据的人来说，您的代码很可能没有意义。您很可能有一个基因表达矩阵，并且对于每一行（基因），您想要执行成对的 wilcox 检验。您从元数据中采样，因此 row_index.

很可能不是最好的统计方法，但这是组织数据的一种方式。首先设置一些数据：

set.seed(111)

celltypes = c("astro","endo","micro","neuron","oligo","opc")
mat = matrix(rnorm(10000*120),ncol=120)
colnames(mat) = paste0("cell",1:120)
rownames(mat) = paste0("gene",1:10000)
metadata = data.frame(celltype=rep(celltypes,each=20))

我们可以通过以下方式进行高效采样：

num_sample = 10
use_cols = tapply(1:nrow(metadata),metadata$celltype,sample,num_sample)
use_cols = unlist(use_cols)
group = metadata$celltype[use_cols]

下面你需要决定 1. 是否使用 FDR 校正（你可以在之后做），以及你是否想要这么大的 data.frame:

library(broom)
allpairs = lapply(1:nrow(mat),function(i){
result = tidy(pairwise.wilcox.test(x=mat[i,use_cols],group,
p.adjust.method ="none"))
result$gene = rownames(mat)[i]
result
})
allpairs = do.call(rbind,allpairs)

上面比较费时间，不过应该运行可以，也可以考虑并行化。结果如下所示：

# A tibble: 6 x 4
  group1 group2 p.value gene 
  <chr>  <chr>    <dbl> <chr>
1 endo   astro    0.853 gene1
2 micro  astro    0.393 gene1
3 neuron astro    0.971 gene1
4 oligo  astro    0.739 gene1
5 opc    astro    0.631 gene1
6 micro  endo     0.247 gene1

allpairs[allpairs$gene=="gene999",]
# A tibble: 15 x 4
   group1 group2 p.value gene   
   <chr>  <chr>    <dbl> <chr>  
 1 endo   astro    0.529 gene999
 2 micro  astro    0.481 gene999
 3 neuron astro    1     gene999
 4 oligo  astro    0.481 gene999
 5 opc    astro    0.912 gene999
 6 micro  endo     1     gene999
 7 neuron endo     0.436 gene999
 8 oligo  endo     0.684 gene999
 9 opc    endo     0.684 gene999
10 neuron micro    0.529 gene999
11 oligo  micro    0.684 gene999
12 opc    micro    0.739 gene999
13 oligo  neuron   0.481 gene999
14 opc    neuron   0.796 gene999
15 opc    oligo    0.853 gene999

如何运行 Wilcox 测试 R 中大量特征的所有组组合？

How to run Wilcox test on all combination of groups across large number of features in R?

statistics

optimization

r

memory-efficient

dataframe

如何 运行 Wilcox 测试 R 中大量特征的所有组组合？

How to run Wilcox test on all combination of groups across large number of features in R?

statistics

optimization

r

memory-efficient

dataframe

如何运行 Wilcox 测试 R 中大量特征的所有组组合？