加速ddply

Question

我有一个 data.frame 这样的：

n  = 50
df = data.frame(group=sample(1:as.integer(n/2),n,replace=T),
                x = runif(n),
                y = runif(n),
                z = runif(n))
df = df[with(df,order(group)),]

对于 group 的每个唯一值，我需要做的是生成段，即生成新列的位置，xend、yend 和 zend，它们是该组中前一个点的 x、y、z 值。对于组中的最后一个值，端点作为组中的第一个点。

我可以通过以下方式执行此操作：

res = ddply(df,"group",function(d){ 
  ixc  = c("x","y","z")
  dfE  = d[,ixc]
  dfE  = rbind(dfE[nrow(dfE),],dfE[1:(nrow(dfE)-1),])
  colnames(dfE) = paste0(ixc,"end")
  cbind(d,dfE)
})
print(head(res))

当 n 很小时，这是微不足道的，但是，当 n 变大时，执行上述操作的时间变得很重要，有没有更快的方法可以做到这一点，也许使用 data.table?

Answer 1

您可以使用 data.table 包中的 shift 函数来做到这一点。 xend 的示例：

library(data.table) 
setDT(df)[, xend := shift(x, 1L, fill = x[.N], type = "lag"), by = group]

对于所有列：

setDT(df)[, c("xend","yend","zend") := .(shift(x, 1L, fill = x[.N], type = "lag"),
                                         shift(y, 1L, fill = y[.N], type = "lag"),
                                         shift(z, 1L, fill = z[.N], type = "lag")),
          by = group]

这给你：

> head(df)
   group          x         y          z       xend      yend       zend
1:     1 0.56725304 0.7539735 0.20542455 0.71538606 0.3864990 0.01586889
2:     1 0.64251519 0.1255183 0.93371528 0.56725304 0.7539735 0.20542455
3:     1 0.14182485 0.7351444 0.89199415 0.64251519 0.1255183 0.93371528
4:     1 0.06613097 0.7625182 0.92669617 0.14182485 0.7351444 0.89199415
5:     1 0.71538606 0.3864990 0.01586889 0.06613097 0.7625182 0.92669617
6:     4 0.27188921 0.5496977 0.09282217 0.27188921 0.5496977 0.09282217

@akrun 在评论中建议的另一种方法：

setDT(df)[, c("xend","yend","zend") := lapply(.SD, function(x) shift(x, fill = x[.N]))
          , by = group]

虽然这种方法需要更少的输入并且在包含变量方面提供了更大的灵活性，但它也相当慢。

在问题中，您说：

For the last value in the group, the ends are taken as the first point in the group.

但是，根据您描述的所需行为，组中的最后一个值使用组中的前一个值。我以为你的意思是：

For the first value in the group, the ends are taken as the last point in the group.

已用数据：

set.seed(1)
n  = 1e5
df = data.frame(group=sample(1:as.integer(n/2),n,replace=T),
                x = runif(n),
                y = runif(n),
                z = runif(n))
df = df[with(df,order(group)),]

加速ddply

Speeding up ddply

r

plyr

data.table