运行 四舍五入

Running Rounding

我正在尝试以 运行 舍入值之和与组内 运行 原始值之和相匹配的方式对列进行舍入。

任务的示例数据包含三列:

这是一个数据示例,已按组内的 ID 排序:

       numbers  ids group
       35.07209 1   1
       27.50931 2   1
       70.62019 3   1
       99.55451 6   1
       34.40472 8   1
       17.58864 10  1
       93.66178 4   3
       83.21700 5   3
       63.89058 7   3
       88.96561 9   3

要生成用于测试的示例数据,我使用以下代码:

  # Make data sample.
  x.size <- 10^6
  x <- list("numbers" = runif(x.size) * 100, "ids" = 1:x.size, "group" = ifelse(runif(x.size) > 0.2 ,1, ifelse(runif(x.size) > 0.8, 2, 3)))
  x<- data.frame(x)
  x <- x[order(x$group), ]

我写了一个函数来保持组内舍入的状态,以确保舍入值的总值是正确的:

makeRunRound <- function() {
  # Data must be sorted by id.
  cumDiff <- 0
  savedId <- 0

  function(x, id) {
  # id here represents the group.

    if(id != savedId) {
      cumDiff <<- 0
      savedId <<- id
    }

    xInt <- floor(x)
    cumDiff <<- x - xInt + cumDiff

    if(cumDiff > 1) {
      xInt <- xInt + round(cumDiff)
      cumDiff <<- cumDiff - round(cumDiff)
    }
    return (xInt)
  }
}

runRound <- makeRunRound()

这种方法行之有效,如果不是为了速度,我会很高兴。

完成 运行 1m 记录样本的舍入需要 2-3 秒。

这对我来说太长了 which works six times faster. I keep the code as given in the answer by josliber:

smartRound <- function(x) {
  y <- floor(x)
  indices <- tail(order(x-y), round(sum(x)) - sum(y))
  y[indices] <- y[indices] + 1
  y
}

使用上面代码生成的示例数据,进行基准测试:

# Code to benchmark speed.
library(microbenchmark)
res <- microbenchmark(
  "run.df" = x$mrounded <- mapply(FUN=runRound, x$numbers, x$group),
  "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
  "smart.df" = x$smart.round <- smartRound(x$numbers),
  "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
  "silly" = x$silly.round <- round(x$numbers),
  times = 50
)
print(res)
boxplot(res)

,产生这些结果:

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval
   run.df 3475.69545 3827.13649 3994.09184 3967.27759 4179.67702 4472.18679    50
   run.dt 2449.05820 2633.52337 2895.51040 2881.87608 3119.42219 3617.67113    50
 smart.df  488.70854  537.03179  576.57704  567.63077  611.81271  861.76436    50
 smart.dt  390.35646  414.96749  468.95317  457.85820  507.54395  631.17081    50
    silly   13.72486   15.82744   19.41796   17.19057   18.85385   88.06329    50

因此,速度从单元级舍入的 20 毫秒变为 2.6 秒,该方法尊重组内舍入值的 运行 总数。

我已经包含了基于 data.framedata.table 的计算比较,以证明没有重大差异,尽管 data.table 略微提高了性能。

我真的很欣赏 smartRound 的简单性和速度,但它不遵守项目的顺序,因此结果会与我需要的不同。

有没有办法:

编辑:

dww 答案给出了最快的解决方案:

diffRound <- function(x) { 
  diff(c(0, round(cumsum(x)))) 
}

我已将测试减少到四个选项:

res <- microbenchmark(
  "silly" = x$silly.round <- round(x$numbers),
  "diff(dww)" = smart.round.dt <- x.dt[, .(rounded = diffRound(numbers)), by = .(group)] ,
  "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
  "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
  times = 50
)

新结果:

Unit: milliseconds
      expr        min         lq       mean     median         uq        max neval
     silly   14.67823   16.64882   17.31416   16.83338   17.67497   22.48689    50
 diff(dww)   54.57762   70.11553   76.67135   71.37325   76.83717  139.18745    50
  smart.dt  392.83240  408.65768  456.46592  441.33212  492.67824  592.57723    50
    run.dt 2564.02724 2651.13994 2751.80516 2708.45317 2830.44553 3101.71005    50

感谢 dww,我在不损失精度的情况下获得了 6 倍的性能提升。

我会这样做,使用简单的基本向量化函数:

首先计算原始数字的 运行 总和,以及 运行 总和的四舍五入值。然后找到一个数字列表,这些数字加起来等于这个四舍五入的总数 运行 使用 diff() 查看每个四舍五入的总和如何大于最后一个。

cum.sum <- cumsum(x$numbers)
cum.sum.rounded <- round(cum.sum)
numbers.round <- diff(cum.sum.rounded)
numbers.round <- c(cum.sum.rounded[1], numbers.round)

检查是否一切如你所愿:

check.cs <- cumsum(numbers.round)
all( abs(check.cs - cum.sum) <=1 )
#TRUE