R: Monte Carlo 程序通过 Permute 或 Sample 函数生成空分布

R: Monte Carlo procedure by Permute or Sample function to generate null distribution

从这个数据集中,我的聚类分析分配了所有患者样本(总共 69 行),聚类被标记为第 3 列“Cluster.assigned”,总共 8 个聚类,每个聚类大小不等。其他列包含变量,其中我想测试数值变量(例如年龄),看看是否有任何东西比随机随机丰富。

现在由于我的编码能力,我遇到了障碍。但我的想法是将真实数据视为 Observed,然后使用样本或置换函数打乱簇的标签,如 Monte Carlo 模拟,假设 1000 次并调用它模拟分布为预期

以年龄列为例:

#minimum dummy 30-row data
Patient.ID <-c("S3077497","S1041120","S162465","S563275","S2911623","S3117192","S2859024","S2088278","S3306185","S190789","S12146451","S2170842","S115594","S2024203","S1063872","S2914138","S303984","S570813","S2176683","S820460","S1235729","S3009401","S2590229","S629309","S120256","S2572773","S3180483","S3032079","S3217608","S5566943")

Cluster.assigned <- c("cluster1","cluster1","cluster1","cluster1","cluster1","cluster1","cluster1","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster2","cluster3","cluster3","cluster3","cluster3","cluster3","cluster3","cluster3","cluster4","cluster4","cluster4")

Age <- c(61,80,78,69,57,70,60,59,72,82,66,68,70,62,82,80,67,77,74,77,74,74,64,70,74,64,54,73,58,87)

CLL_3S <-cbind(Patient.ID, Cluster.assigned, Age)

要查看是否有任何集群的患者在特定年龄段有所丰富,零假设是集群之间的年龄分布没有差异。 现在我应该打乱患者标签或打乱年龄数据,比如 1000 次,然后我应该有一个模拟数据框,从中我应该能够计算模拟(预期)

的均值和标准差
#I image to use shuffle to permute 1000 times
#And combine the simulated into a massive dataframe
 shuffled <- numeric(length=1000)
 N <-nrows(CLL_3S)

 set.seed(123)
  for (i in seq_len(length(shuffled) -1)) {
      perm <- shuffle(N)
      .........

下一步是我将使用每个集群中患者年龄的实际观察结果,通过使用 Z 分数来计算富集。说 obs(值 - 预期平均值)/SD。

一旦这个过程自动化,我就可以将其应用于其他感兴趣的列和其他具有不同簇数的数据集。我已经阅读了一些关于 sample() 和 shuffle() 的内容,但它并没有真正帮助我解决这个特定问题...

我不确定下面的代码是否符合您的目标。如果我正确理解你的问题,我应该做的是只打乱集群分配,然后添加一个新的 z-score 列,按集群标签分组。

  • sample 进行随机洗牌
  • scale用于计算z-score
  • ave 帮助计算 z-score by cluster labels
  • replicate是对运行多次模拟
replicate(1000,
  within(
    transform(CLL_3S,
      Cluster.assigned = Cluster.assigned[sample(1:nrow(CLL_3S))]
    ),
    zscore <- ave(Age, Cluster.assigned, FUN = scale)
  ),
  simplify = FALSE
)

更新

如果您只想对 1000 次模拟的均值和标准偏差进行平均,您可以尝试下面的代码

n <- 1000
res <- Reduce(
  `+`,
  replicate(n,
    with(
      CLL_3S,
      do.call(rbind, tapply(Age, Cluster.assigned[sample(1:nrow(CLL_3S))], FUN = function(x) c(Mean = mean(x), Var = var(x))))
    ),
    simplify = FALSE
  )
) / n
res <- within(as.data.frame(res), SD <- sqrt(Var))

这给出了

> res
             Mean      Var       SD
cluster1 70.21086 68.99152 8.306114
cluster2 70.06915 71.93188 8.481267
cluster3 70.03571 70.19276 8.378112
cluster4 70.12500 68.98867 8.305942

数据

> dput(CLL_3S)
structure(list(Patient.ID = c("S3077497", "S1041120", "S162465", 
"S563275", "S2911623", "S3117192", "S2859024", "S2088278", "S3306185",
"S190789", "S12146451", "S2170842", "S115594", "S2024203", "S1063872",
"S2914138", "S303984", "S570813", "S2176683", "S820460", "S1235729",
"S3009401", "S2590229", "S629309", "S120256", "S2572773", "S3180483",
"S3032079", "S3217608", "S5566943"), Cluster.assigned = c("cluster1",
"cluster1", "cluster1", "cluster1", "cluster1", "cluster1", "cluster1", 
"cluster2", "cluster2", "cluster2", "cluster2", "cluster2", "cluster2",
"cluster2", "cluster2", "cluster2", "cluster2", "cluster2", "cluster2",
"cluster2", "cluster3", "cluster3", "cluster3", "cluster3", "cluster3",
"cluster3", "cluster3", "cluster4", "cluster4", "cluster4"), 
    Age = c(61, 80, 78, 69, 57, 70, 60, 59, 72, 82, 66, 68, 70,
    62, 82, 80, 67, 77, 74, 77, 74, 74, 64, 70, 74, 64, 54, 73,
    58, 87)), class = "data.frame", row.names = c(NA, -30L))