如何按组在 R 中的数据框中执行 KS.test

Question

我想在 R 中针对下面的示例数据帧 "data" 执行两个样本的 Kolmogorov-Smirnov (KS) 测试：

    Protein1    Protein2    Protein3    Protein4    Protein5    Protein6    Protein7    Protein8    Group
Sample1 0.56    1.46    0.64    2.53    1.96    305.29  428.41  113.22  Control
Sample2 0.75    2.29    0.38    4.31    1.72    307.95  492.58  82.75   Control
Sample3 2.05    1.73    2.42    14.75   2.92    523.92  426.94  131.51  Control
Sample4 1.71    5.37    0.68    6.39    2.02    343.05  435.16  123.12  Control
Sample5 13.31   0.94    1.21    3.83    2.83    313.71  327.84  66.8    Control
Sample6 0.36    1.81    0.42    2.25    1.48    335.23  352.55  93.81   Control
Sample7 0.28    3.26    0.49    2.62    1.96    251.49  468.19  80.27   Control
Sample8 1.62    17.01   0.49    2.87    1.7 254.79  402.9   86.8    Control
Sample9 0.59    2.64    0.18    2.93    1.23    388.87  384.53  109.52  Control
Sample10    1.67    3.75    0.43    3.89    1.83    306.28  440.86  97.55   Control
Sample11    15.53   12.02   0.57    1.81    2.31    328.56  352.98  118.18  Control
Sample12    0.94    7.06    35.77   4.98    2.44    389.14  376.18  119.75  Control
Sample13    2.07    1.73    0.38    3.89    2.56    233.81  377.21  72.1    Control
Sample14    3.2 1.38    0.5 4.05    2.51    406.57  538.57  209.16  Patient
Sample15    1.33    2.17    0.46    3.31    1.72    278.04  276.37  79.4    Patient
Sample16    1.48    2.9 0.84    9.27    1.94    332.76  413.66  99.09   Patient
Sample17    2.02    1.96    0.25    4.16    1.96    358.73  383.63  107.46  Patient
Sample18    2.94    2   0.27    3.99    2.55    354.78  493.02  145.36  Patient
Sample19    1.01    8.1 0.35    3.65    1.62    335.18  264.74  145.15  Patient
Sample20    6.95    15.48   2.94    3.64    2.3 307.23  484.38  119.61  Patient
Sample21    0.52    1.38    0.56    3.08    1.86    244.3   304.74  76.87   Patient
Sample22    0.35    2.17    0.38    4.51    2.09    304.68  369.98  151.76  Patient
Sample23    2.26    2.9 0.3 4.44    2.43    302.51  367.51  150.69  Patient
Sample24    3.19    1.96    0.81    2.94    2.15    309.59  362.18  133.49  Patient
Sample25    1.12    2   0.71    3.77    2.42    334.36  358.9   131.35  Patient
Sample26    5.28    8.1 0.81    22.84   2.35    422.18  369.71  76.35   Patient

要在 2 个单独的列之间执行 KS 测试，这是代码：

> ks.test(data<span class="math-container">$Protein1, data$</span>Protein2, data=data)

    Two-sample Kolmogorov-Smirnov test

data:  data<span class="math-container">$Protein1 and data$</span>Protein2
D = 0.42308, p-value = 0.01905
alternative hypothesis: two-sided

Warning message:
In ks.test(data<span class="math-container">$Protein1, data$</span>Protein2, data = data) :
  cannot compute exact p-value with ties

但是，我想按列和按组执行此操作。例如，对于 t.test 或 wilcox.test 很容易做到这一点，因为您可以将其编码为 t.test(y1,y2) 或 t.test(y~x) # 其中 y 是数字，x 是二进制因子但是当涉及到二进制因子时，没有 ks.test 的代码。有人可以帮忙吗？

最终，我想对所有蛋白质的整个数据框执行此操作，就像我可以成功进行 t 检验一样，但我想对 ks.test 执行以下操作：

t_tests <- lapply(
  data[, -1], # apply the function to every variable *other than* the first one (group)
  function(x) { t.test(x ~ HealthGroups, data = data) }
)

感谢任何帮助。提前致谢。

Answer 1

这是一个非常简单的方法。这使用了一个循环，这在 R 圈子中通常是不受欢迎的。然而，它非常简单并且self-explanatory，这对于新用户来说是一个加分项，并且在这种情况下循环太慢也没有问题。（请注意，如果您愿意，可以使用 lapply()，但这仍然是一个循环，它只是在外面看起来不同。）

只需创建两个具有相同变量的新子集数据框。然后循环调用 ks.test 的数据帧。输出不是很 user-friendly——它只会说 j——所以我添加了对 ?writeLines 的调用以打印被测试变量的名称。

# I am assuming the original data frame is called d
dc <- d[d$Group=="Control",]
dp <- d[d$Group=="Patient",]
for(j in 1:8){  
  writeLines(names(dc)[j])
  print(ks.test(dc[,j], dp[,j]))  
}
# Protein1
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.30769, p-value = 0.5882
# alternative hypothesis: two-sided
# 
# Protein2
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.23077, p-value = 0.8793
# alternative hypothesis: two-sided
# 
# Protein3
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.23077, p-value = 0.8793
# alternative hypothesis: two-sided
# 
# Protein4
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.46154, p-value = 0.1254
# alternative hypothesis: two-sided
# 
# Protein5
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.23077, p-value = 0.8793
# alternative hypothesis: two-sided
# 
# Protein6
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.15385, p-value = 0.9992
# alternative hypothesis: two-sided
# 
# Protein7
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.38462, p-value = 0.2999
# alternative hypothesis: two-sided
# 
# Protein8
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.46154, p-value = 0.1265
# alternative hypothesis: two-sided
# 
# Warning messages:
# 1: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
# 2: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
# 3: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
# 4: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties

这将创建一个数据框来存放结果：

dk <- data.frame(Protein=character(8), D=numeric(8), p=numeric(8), stringsAsFactors=F)
for(j in 1:8){  
  k <- ks.test(dc[,j], dp[,j])
  dk$Protein[j] <- names(dc)[j]
  dk$D[j]       <- k$statistic
  dk$p[j]       <- k$p.value
}
dk
#    Protein         D         p
# 1 Protein1 0.3076923 0.5881961
# 2 Protein2 0.2307692 0.8793244
# 3 Protein3 0.2307692 0.8793244
# 4 Protein4 0.4615385 0.1253895
# 5 Protein5 0.2307692 0.8793244
# 6 Protein6 0.1538462 0.9992124
# 7 Protein7 0.3846154 0.2999202
# 8 Protein8 0.4615385 0.1264877

如何按组在 R 中的数据框中执行 KS.test

How to perform KS.test in a dataframe in R by groups

r

kolmogorov-smirnov