如果 id 对于相应的向量索引不是唯一的，则分配值的平均值

Question

对于下面的例子，我正在寻找一个高效的解决方案。仅在基础 R 中。

# toy data/example
idx <- c(1,2,3,3,4)
vals <- c(3,6,7,1,5)
res <- rep(NA, length = 10)

res[idx] <- vals

# gives 
res[idx]
#> [1] 3 6 1 1 5

我的目标是：

# desired output
res[idx] 
[1] 3 6 4 4 5

例如如果 idx 不是唯一的（$idx=3$ 的情况），我想存储 $7+1$ 的平均值而不是 $1$ [最后评估值]。

注意，在实际应用中$idx=3$可能会出现多次。此外，idx.

中可以有数以千计的非唯一 indices/values

Answer 1

您可以使用 aggregate 获取每个 idx 的 mean。

. <- aggregate(vals ~ idx, FUN=mean)
res[.$idx] <- .$vals
res[idx]
#[1] 3 6 4 4 5

或使用tapply.

. <- tapply(vals, idx, mean)
res[as.integer(names(.))] <- .
res[idx]
#[1] 3 6 4 4 5

Answer 2

如果您因速度问题而要求基本 R 解决方案，您可能想要探索 tapply-解决方案，例如：res <- tapply(vals, idx, mean)[idx] 优于上面接受的 aggregate-解决方案. （现在，作者也添加了tapply-解决方案）。

测试表明它确实更快：

idx <- c(1,2,3,3,4)
vals <- c(3,6,7,1,5)
res <- rep(NA, length = length(idx))

agg_fun <- function(res, vals, idx) { # By: GKi
  
  . <- aggregate(vals ~ idx, FUN=mean)
  res[.$idx] <- .$vals
  res[idx]
  
}

ave_fun <- function(res, vals, idx) { # By: Pax/MrFlick

  res <- ave(vals, idx, FUN = mean)
  res

}

apply_fun <- function(res, vals, idx) {
  
  res <- tapply(vals, idx, mean)[idx] |> as.vector()
  res
  
}

bench::mark(
  agg_fun(res, vals, idx),
  ave_fun(res, vals, idx),
  apply_fun(res, vals, idx)
)
  
# A tibble: 3 × 13
# expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   
# <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>   
# agg_fun(res, vals, idx)    221.9µs  259.7µs     3710.        0B     2.43  1527     1      412ms <dbl [5]>
# ave_fun(res, vals, idx)       36µs   39.8µs    24262.        0B     2.45  9883     1      407ms <dbl [5]>
# apply_fun(res, vals, idx)   34.2µs   36.1µs    27200.        0B     5.47  9941     2      365ms <dbl [5]>

在更大的样本上也更快：

idx2 <- sample(1:100, 100000, replace = TRUE)
vals2 <- sample(1:1000, 100000, replace = TRUE)
res2 <- rep(NA, length = length(idx2))

bench::mark(
  agg_fun(res2, vals2, idx2),
  ave_fun(res2, vals2, idx2),
  apply_fun(res2, vals2, idx2)
)

# A tibble: 3 × 13
# expression                        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
# <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
# agg_fun(res2, vals2, idx2)    28.37ms  28.89ms      34.6    26.5MB     2.47    14     1      404ms <dbl> 
# ave_fun(res2, vals2, idx2)     4.16ms   4.57ms     220.     5.98MB     2.53    87     1      396ms <dbl> 
# apply_fun(res2, vals2, idx2)   2.74ms   2.82ms     328.     6.36MB     2.58   127     1      387ms <dbl>

如果 id 对于相应的向量索引不是唯一的，则分配值的平均值

Assign mean of values if id is not unique to respective vector index

r