在 R 中的成对误差估计中产生了错误的产品

Wrong product produced in pairwise error estimation in R

我想从融化的矩阵中产生一个成对的错误,看起来像这样:

pw.data = data.frame(true_tree = rep(c("maple","oak","pine"),3), 
                 guess_tree = c(rep("maple",3),rep("oak",3),rep("pine",3)),
                 value = c(12,0,1,1,15,0,2,1,14))


true_tree guess_tree value
  maple      maple    12
    oak      maple     0
   pine      maple     1
  maple        oak     1
    oak        oak    15
   pine        oak     0
  maple       pine     2
    oak       pine     1
   pine       pine    14

所以我想估计真实树种和猜测树种之间的成对误差。对于此估计,公式应为“成对错误分配/所选两个物种的所有估计数。

给出更好的解释:枫木和橡木的错误猜测(枫木-橡木和橡木-枫木比较)= 1 + 0 / 所有猜测数 = 12 + 1 + 2(所有计数 true_tree == "maple) + 0 + 15 + 1 (all counts for true_tree == "橡木)。所以估计乘积是1/31.

当我针对一种特定情况进行检查时,让我们再说一遍枫木和橡木,我可以像这样手动估算:

sum(pw.data[((pw.data[,1] == "maple" & pw.data[,2] == "oak") | 
      (pw.data[,1] == "oak" & pw.data[,2] == "maple")) &
      (pw.data[,1] != pw.data[,2]),3]) / 
 (sum(pw.data[pw.data[,1] == "maple",3]) + sum(pw.data[pw.data[,1] == "oak",3]))

但是,我想对更大的数据进行估算,因此,我想创建一个 for loop/function 来进行估算并将结果存储在数据框中,例如:

Pw_tree   value
Maple-oak 0.0123
....

我曾尝试在如下所示的 for 循环中使用该逻辑,但它根本不起作用。

for (i in pw.data[,1]) { 
for (j in pw.data[,2]) {
x = sum( pw.data[((pw.data[,1] == i & pw.data[,2] == j ) | 
                (pw.data[,1] == j & pw.data[,2] == i)) &
               (pw.data[,1] != pw.data[,2]),3])  
y = (sum(pw.data[pw.data[,1] == i,3]) + sum(pw.data[pw.data[,1] == j,3]))
   PWerr_data = data.frame( pw_tree = paste(i,j, sep = "-"), value = x/y)
 }

}

那就太好了,如果我能看到我做错了什么。 非常感谢!

我通常通过构建我想要应用的函数(你几乎已经完成)来解决这些类型的问题,然后构建最方便应用它的数据结构,然后我可以使用一个apply 系列函数中的一个,用于遍历我的数据结构以获得结果。这让我避免了 for 循环结构,这很好,因为我是那种总是会在双 for 循环中搞砸索引的程序员。

对于您的情况,我们可以将您的总和比率包装到一个函数中,该函数以 data.frame 和两个树名作为参数。然后我们只需要创建我们想要使用的一组对。一个方便的函数是 combn(),它允许您从 x 的元素中获取大小 m 的所有组合:这将为我们提供所需的一组非冗余对。

下面的注释示例代码:

# Load your data
pw.data = data.frame(true_tree = rep(c("maple","oak","pine"),3), 
                     guess_tree = c(rep("maple",3),rep("oak",3),rep("pine",3)),
                     value = c(12,0,1,1,15,0,2,1,14))
pw.data
#>   true_tree guess_tree value
#> 1     maple      maple    12
#> 2       oak      maple     0
#> 3      pine      maple     1
#> 4     maple        oak     1
#> 5       oak        oak    15
#> 6      pine        oak     0
#> 7     maple       pine     2
#> 8       oak       pine     1
#> 9      pine       pine    14

# build the function we will repeatedly apply
getErr <- function(t1, t2, data=pw.data) {
  # compute the rate as you wrote it
  rate <- sum(data[((pw.data[,1] == t1 & data[,2] == t2) | 
               (data[,1] == t2 & data[,2] == t1)) &
              (data[,1] != data[,2]),3]) / 
  (sum(data[data[,1] == t1,3]) + sum(data[data[,1] == t2,3]))

  # output the items involved as a named list (useful for later)
  list(Pw_tree = paste(t1, t2, sep='-'), error_rate = rate)
  }

# test it
getErr("maple", "oak")
#> $Pw_tree
#> [1] "maple-oak"
#> 
#> $error_rate
#> [1] 0.03225806
# Good this matches the output you supplied

# build the data structure we will run the function across
all.trees <- unique(c(as.character(pw.data$true_tree), as.character(pw.data$guess_tree)))
all.name.combos <- combn(all.trees, 2)

# we will use the do.call(rbind, ls) trick, where we generate a list
# with the apply function and coerce it to a matrix
error_rates_df <- do.call(rbind, apply(all.name.combos, 2, function(row){getErr(row[1], row[2])}))
error_rates_df
#>      Pw_tree      error_rate
#> [1,] "maple-oak"  0.03225806
#> [2,] "maple-pine" 0.1       
#> [3,] "oak-pine"   0.03225806

reprex package (v0.2.1)

于 2018-10-30 创建