在 R 中有效地更改 matrix/array 中的单个元素

Question

我是运行 R 中的一个模拟，我正在努力提高它的效率。

一些背景知识：这是一个抽象模拟，用于测试突变对种群的影响。种群有 N 个个体，每个个体有 M 个字母的基因型，每个字母可以是二十种氨基酸之一（我表示为 0:19）。

最（计算上）最昂贵的任务之一涉及采用具有 M 行和 N 列的矩阵“mat”，该矩阵最初以全零矩阵开始，

mat <- matrix(rep(0,M*N),nrow=M)

然后改变（突变）每个个体基因型中的至少一个字母。我至少说的原因是，理想情况下，我想设置一个突变率 (mutrate)，如果我在我的整体模拟函数中设置为 2，它将导致每个个体在矩阵中发生 2 次突变。

我发现了两种计算成本相当高的方法。正如您在下面看到的，只有第二种方法合并了突变率参数 mutrate（我无法轻易想到如何将其合并到第一种方法中）。

   #method 1
   for(i in 1:N){
   position <- floor(runif(N, min=0, max=M))
   letter <- floor(runif(N, min=0, max=19))
   mat[position[i],i] = letter[i]}
   #method 2, somewhat faster and incorporates mutation rate
mat <- apply(mat,2,function(x) (x+sample(c(rep(0,M-mutrate),sample(0:19,size=mutrate))%%20))))

第二种方法包含模数，因为正如我提到的，基因型值必须介于 0 和 19 之间。

为了清楚起见，还有一些附加说明：

我并不严格要求每个人都获得完全相同的突变量。但话虽这么说，分布应该足够窄，这样，如果 mutrate = 2，大多数人会得到两个突变，有些是一个，有些可能是三个。然而，我不希望一个人获得大量突变而许多人没有突变值得注意的是，一些突变会将字母变成同一个字母，因此对于大人口规模 N，预期的平均突变数略微小于指定的 mutrate。
我相信答案与使用方括号子集方法从矩阵mat的每一列中获取一个随机元素的能力有关。但是，我找不到任何有关如何使用语法从矩阵的每一列中分离出一个随机元素的信息。 mat[sample(1:M),sample(1:N)] 显然给了你整个矩阵……也许我在这里遗漏了一些非常清楚的东西。

非常感谢任何帮助！

Answer 1

先回答你最后一个问题；您可以使用 mat[row,column] 访问矩阵中的单个单元格，或通过顺序单元格 ID 访问多个分散的单元格。单元格 1,1 是第一个单元格，后面是 2,1、3,1 等：

mat <- matrix(rep(0, 5*5), nrow=5)
mat[c(1,3,5,7,9)] = c(1,2,3,4,5)

mat
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    4    0    0    0
[3,]    2    0    0    0    0
[4,]    0    5    0    0    0
[5,]    3    0    0    0    0

访问/覆盖单个单元格也很快。我能想到的执行任务的最快方法是首先为我们想要的值创建向量。所有列索引的向量（每列的次数为 mutrate），行索引的向量（随机），以及这些 column/row 组合的新值向量（随机）。

cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed

for(i in seq_len(N*mutrate)) {
  mat[rows[i],cols[i]] = values[i]
}

我们还可以计算 cell-IDs 而不是 for-loop 来更新矩阵，这样我们就可以一次性更新所有矩阵单元格：

cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
cellid = rows + (cols-1)*M
  
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
  
mat[cellid] = values

尝试使用 6000x10000 矩阵对多种方法进行基准测试，显示每种方法的速度：

N = 6000  # individuals
M = 10000 # genotype length

genotypes = 20
mutrate = 2

method1 <- function() {
  
  mat <- matrix(rep(0,M*N),nrow=M)
  
  for(i in 1:(N*mutrate)){
    position <- sample(M, 1)
    letter <- sample(genotypes, 1) - 1
    mat[position,i%%N] = letter
  }
  
  return(mat)
  
}

method2 <- function() {
  
  mat <- matrix(rep(0,M*N),nrow=M)
  mat <- apply(mat,2,function(x) (x+sample(c(rep(0,M-mutrate),sample(0:19,size=mutrate))%%20)))
  
}

method3 <- function() {
  
  mat <- matrix(rep(0,M*N),nrow=M)
  
  cols = rep(seq_len(N), mutrate)
  rows = sample(M, N*mutrate, replace = T)
  values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
  
  for(i in seq_len(N*mutrate)) {
    mat[rows[i],cols[i]] = values[i]
  }
  
  return(mat)
  
}

method4 <- function() {
  
  mat <- matrix(rep(0,M*N),nrow=M)
  
  cols = rep(seq_len(N), mutrate)
  rows = sample(M, N*mutrate, replace = T)
  cellid = rows + (cols-1)*M
  
  values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
  
  mat[cellid] = values
  
  return(mat)
  
}

benchmark <- function(func, times=10) {
  begin <- as.numeric(Sys.time())
  for(i in seq_len(times))
    retval <- eval(parse(text=func))
  end <- as.numeric(Sys.time())
  cat(func, 'took', (end-begin)/times, 'seconds\n')
  return(retval)
}

ret1 <- benchmark('method1()')
ret2 <- benchmark('method2()')
ret3 <- benchmark('method3()')
ret4 <- benchmark('method4()')

我已经修改了您的第一个方法以加快速度并执行 mutrate。

method1() took 0.8936087 seconds
method2() took 8.767686 seconds
method3() took 0.7008878 seconds
method4() took 0.6548331 seconds

在 R 中有效地更改 matrix/array 中的单个元素

Efficiently change individual elements in matrix/array in R

r

matrix