如何使用 R 中的嵌套函数高效地执行复杂的行操作?

How to efficiently do complex row operations with nested functions in R?

给定一个多维数组,例如zoo 对象 z,包含列 a,b,c,x。进一步给出一个函数 W(w=c(1,1,1), x),例如单独对每一列加权,但 也取决于列 x 中的特定行值。如何在这里有效地进行行操作,例如计算 rowWeightedMeans?

众所周知,R::zoo对于行操作非常快速高效,如果函数非常简单,例如:

W <- function(w) { return(w); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3)))

但是如果 W() 取决于该行中的值怎么办?例如:

W <- function(w, x) { return(w*x); }
z[,"wmean"] <- rowWeightedMeans(z[,1:3], w=W(c(0.1,0.5,0.3), z[,4]))

R 在这里抱怨,因为它不知道如何处理 nested 函数中参数的多维。

解决方案可以是 for(i in 1:nrow(z)) 循环,并为每一行单独计算值 i。但是,对于大型数据集,这需要大量额外的计算工作和时间。

编辑

好的伙计们,感谢您的宝贵时间和批评。我尝试并测试了您所有的答案,但必须承认实际问题没有得到解决或理解。例如,我没有要求重写我的权重函数或计算,因为我已经提供了更复杂计算的最小版本。这里的问题或问题要深得多。所以我坐下来,试图将问题归结为邪恶的根源,并为您找到了一个没有任何 zoos、weightedMeans 等的最小工作示例。给你:

z <- data.frame(matrix (1:20, nrow = 4))
colnames (z) <- c ("a", "b", "c", "x", "y")
z
#   a b  c  x  y
#1 1 5  9 13 17
#2 2 6 10 14 18
#3 3 7 11 15 19
#4 4 8 12 16 20

W <- function(abc, w, p) { 
  ifelse (w[1] == p, return(length(p)), return(0))
  # Please do not complain! I know this is stupid, but it is an MWE
  # and my calculations contained in W() are much more complex!
}

z[,"y"] <- W(z[,1:3], c(14,7,8), z[,"x"])
# same result: z[,"y"] <- apply(z[,1:3], 1, W, c(14,7,8), z[,"x"])
z
#  a b  c  x y
#1 1 5  9 13 4
#2 2 6 10 14 4
#3 3 7 11 15 4
#4 4 8 12 16 4

# expected outcome:
#  a b  c  x y
#1 1 5  9 13 0
#2 2 6 10 14 4
#3 3 7 11 15 0
#4 4 8 12 16 0

我面临的问题是,R 将 z[,"x"] 的所有行传递给函数,但是,我希望它只采用与 z[,"y"] 行对应的行,即当前在 R 循环通过它时在内部处理。在此示例中,我希望 14==14 仅出现在第 2 行! 那么:如何告诉 R 逐行传递给函数?

解决方案

除了获奖和接受的答案,我想在这里总结一下解决方案,以提高清晰度并更好地概述讨论。

这个问题不是关于重写特定函数W(例如加权)。这只是关于 R 无法将多个逐行参数传递给通用函数的问题。通过使用 z$y <- f(z$a, z$x)z$y <- apply(z$a, 1, f, z$x),这两种方法仅将 第一个 参数作为逐行传递,第二个参数作为包含所有行的完整列传递.这似乎是我们需要解决的 R 的固有行为。

为了解决这个问题,需要将整行作为单个参数传递给包装函数,然后包装函数会对该行应用特定的计算。权重问题的解决方案:

f <- function(x) weighted.mean(x[1:3], W(c(0.1,0.5,0.3), x[4]))
z[,"wmean"] <- apply(z[,1:4], 1, f)

数据框一般问题的解决方案:

f <- function(x) W(x[1:3], c(14,7,8), x[4])
z$y <- apply(z, 1, f)

Brian 在他接受的答案中还提供了使用编译的 C 代码的更快的方法。感谢@BrianAlbertMonroe、@jaimedash 和@inscaven 处理了这个问题,并暗示了这个解决方案。

还没有真正使用过 zoorowWeightedMeans,但是如果您只是在对行元素取平均值之前将权重应用于行元素,并且要求权重取决于其中一个元素行:

z <- matrix(rnorm(100),ncol=4)

W <- function(row, weights){
    weights <- weights * row[4]
    row2 <- row[1:3] * weights
    sum(row2) / sum(weights)

}

w.means <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))

如果上面给出了正确答案但你担心速度快,请在 Rcpp 中编写 W 函数或使用内置 cmpfun,

N <- 10000

z <- matrix(rnorm(N),ncol=4)

# Interpreted R function
W1 <- function(row, weights){
    weights <- weights * row[4]
    row2 <- row[1:3] * weights
    mean(row2)
}

# Compiled R function
W2 <- compiler::cmpfun(W1)

# C++ function imported into R via Rcpp
Rcpp::cppFunction('double Wcpp(NumericVector row, NumericVector weights){

                                int x = row.size() ;

                                NumericVector wrow(x - 1);
                                NumericVector nweights(x - 1);

                                nweights = weights * row[x - 1];

                                for( int i = 0; i < (x-1) ; i++){
                                    wrow[i] = row[i] * nweights[i];
                                }

                                double res = sum(wrow) / sum(nweights);

                                return(res);

}')

w.means0 <- apply(z,1,W,weights=c(0.1,0.5,0.3))
w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3))
w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3))

identical( w.means0, w.means1, w.means2 )

#[1] TRUE

# Write the whole thing in C++
Rcpp::cppFunction('NumericVector WM(NumericMatrix z , NumericVector weights){
                                int x = z.ncol() ;
                                int y = z.nrow() ;

                                NumericVector res(y);
                                NumericVector wrow(x - 1);

                                NumericVector nweights(x - 1);
                                double nwsum;
                                double mult;

                                for( int row = 0 ; row < y ; row++){

                                    mult = z(row,x-1);

                                    nweights = weights * mult;
                                    nwsum = sum(nweights);

                                    for( int i = 0; i < (x-1) ; i++){

                                        wrow[i] = z(row,i) * nweights[i] ;
                                    }

                                  res[row] = sum(wrow) / nwsum;

                                }

                                return(res);

}')

microbenchmark::microbenchmark(
    w.means0 <- apply(z,1,W1,weights=c(0.1,0.5,0.3)),
    w.means1 <- apply(z,1,W2,weights=c(0.1,0.5,0.3)),
    w.means2 <- apply(z,1,Wcpp,weights=c(0.1,0.5,0.3)),
    w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3))
)

    Unit: microseconds
                                                      expr       min         lq       mean     median         uq       max neval
   w.means0 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 12114.834 12536.9330 12995.1722 12838.2805 13163.4835 15796.403   100
   w.means1 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3))  9941.571 10286.8085 10769.7330 10410.9465 10788.6800 19526.840   100
 w.means2 <- apply(z, 1, Wcpp, weights = c(0.1, 0.5, 0.3)) 10919.112 11631.5530 12849.7294 13262.9705 13707.7465 17438.524   100
         w.means3 <- WM(z = z, weights = c(0.1, 0.5, 0.3))    94.172   107.9855   146.2606   125.0075   140.2695  2089.933   100

编辑:

合并weighted.means函数会大大降低计算速度,并且不会根据帮助文件专门处理缺失值,因此您仍然需要编写代码来管理它们。

> z <- matrix(rnorm(100),ncol=4)

> W <- function(row, weights){
+     weights <- weights * row[4]
+     row2 <- row[1:3] * weights
+     sum(row2) / sum(weights)
+ 
+ }

> W1 <- compiler::cmpfun(W)

> W2 <- function(row, weights){
+     weights <- weights * row[4]
+     weighted.mean(row[1:3],weights)
+ }

> W3 <- compiler::cmpfun(W2)

> w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3))

> w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3))

> identical(w.means1,w.means2)
[1] TRUE

> microbenchmark(
+   w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)),
+   w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)),
+   w.means2 < .... [TRUNCATED] 
Unit: microseconds
                                                    expr     min       lq     mean   median       uq     max neval
  w.means1 <- apply(z, 1, W, weights = c(0.1, 0.5, 0.3)) 145.315 167.4550 172.8163 172.9120 180.6920 194.673   100
 w.means1 <- apply(z, 1, W1, weights = c(0.1, 0.5, 0.3)) 124.087 134.3365 143.6803 137.8925 148.7145 225.459   100
 w.means2 <- apply(z, 1, W2, weights = c(0.1, 0.5, 0.3)) 307.311 346.6320 356.4845 354.7325 371.7620 412.110   100
 w.means2 <- apply(z, 1, W3, weights = c(0.1, 0.5, 0.3)) 280.073 308.7110 323.0156 324.1230 333.7305 407.963   100

我认为这可以通过巧妙的重塑来解决。我会为此使用 dplyr - 但工作流程应该与 plyr 或 data.table 类似 - 所有这些包都经过了大量优化。

对于这个例子,我假设权重函数是 w(x) = w0 ^ x

这里我创建了一些样本数据 z 和通用权重 w(注意我向 z 添加了行号 r):

library(dplyr)
library(tidyr)
N <- 10
z <- data.frame(r=1:N, a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))

现在计算为:

 res <- z %>% gather(key,value,-r,-x) %>% # convert to long format, but keep row numbers and x
  left_join(w, 'key') %>%   # add generic weights
  mutate(eff_weight = weight^x) %>% # calculate effective weights
  group_by(r) %>% # group by the orignal lines for the weighted mean
  summarise(ws = sum(value*eff_weight), ww=sum(eff_weight)) %>% # calculate to helper values
  mutate(weighted_mean = ws/ww) %>% # effectively calculate the weighted mean
  select(r, weighted_mean) # remove unneccesary output

left_join(z, res) # add to the original data

我添加了一些注释 - 但如果您无法理解,可以逐步评估 res(删除包括 %>% 在内的尾部)并查看结果。

更新

接受挑战,找到在 base R 中做同样事情的方法:

N <- 10
z <- data.frame(a=rnorm(N), b=rnorm(N), c=rnorm(N), x=rpois(N, 5))
w <- data.frame(key=c('a','b','c'), weight=c(0.1,0.5,0.3))

long.z <- reshape(z, idvar = "row", times=c('a','b','c'),
                timevar='key',
                varying = list(c('a','b','c')), direction = "long")
compose.z <- merge(long.z,w, by='key')
compose.z2 <- within(compose.z, eff.weight <- weight^x)

sum.stat <- by(compose.z2, compose.z2$row, function(x) {sum(x$a * x$eff.weight )/sum(x$eff.weight)})

nice.data <- c(sum.stat)

它需要更详细的函数。但可以应用相同的模式。

这是 zoo::rollapply 的解决方案。对于更简单的情况,它产生与 matrixStats::rowWeightedMeans 相同的答案。

if(! require(matrixStats)) {
        install.packages('matrixStats')
        library(matrixStats)
}
if(! require(zoo)) {
        install.packages('zoo')
        library(zoo)
}
z <- zoo (matrix (1:20, nrow = 5))
colnames (z) <- c ("a", "b", "c", "x")
z$x <- 0 # so we can see an effect below...
z
##   a  b  c x
## 1 1  6 11 0
## 2 2  7 12 0
## 3 3  8 13 0
## 4 4  9 14 0
## 5 5 10 15 0

weights <- c(0.1,0.5,0.3)
W <- function (w) { return(w); }
z$wmean <- rowWeightedMeans(z[,1:3], w=W(weights))
## z[,new]<- doesn't work to create new columns in zoo
## objects
## use $

rowWeightMean_zoo <- function (r, W, weights) {
        s <- sum(W(weights))
        return(sum(r[1:3] * W(weights) / s))
}

z$wmean_zoo <- rollapply(z, width=1, by.column=FALSE,
                         function (r) rowWeightMean_zoo(r, W, weights))
z

对于问题中的要求,return值依赖于行中的一些辅助数据,rowWeightedMeans不起作用。但是,可以修改传递给 rollapply 的函数以使用该行的其他元素。

W2 <- function (w, x) { return(w * x); }
# z$wmean2 <- rowWeightedMeans(z[,1:3], w=W2(c(0.1,0.5,0.3), z[,4]))
## doesn't work
## Error in rowWeightedMeans(z[, 1:3], w = W@(c(0.1, 0.5, 0.3), z[, 4])) :
##   The length of argument 'w' is does not match the number of column in 'x': 5 != 3
## In addition: Warning message:
## In `*.default`(w, x) :
##   longer object length is not a multiple of shorter object length
## Calls: rowWeightedMeans -> W -> Ops.zoo -> NextMethod

rowWeightMean_zoo_dependent <- function (r, W, weights) {
        s <- sum(W(weights, r[4]))
        return(sum(r[1:3] * W2(weights, r[4]) / s))
}
z$wmean2_zoo <- rollapply(z, width=1, by.column=FALSE,
                         function (r) rowWeightMean_zoo_dependent(r, W2, weights))
z
##   a  b  c x     wmean wmean_zoo wmean2_zoo
## 1 1  6 11 0  7.111111  7.111111        NaN
## 2 2  7 12 0  8.111111  8.111111        NaN
## 3 3  8 13 0  9.111111  9.111111        NaN
## 4 4  9 14 0 10.111111 10.111111        NaN
## 5 5 10 15 0 11.111111 11.111111        NaN