拆分一个data.table,然后通过引用修改

splitting a data.table, then modifying by reference

我有一个用例,我需要拆分 data.table,然后对每个分区应用不同的按引用修改操作。但是,拆分会强制复制每个 table。

这是鸢尾花数据集上的玩具示例:

#split the data
DT <- data.table(iris)
out <- split(DT, DT$Species)

#assign partitions to global environment
NAMES <- as.character(unique(DT$Species))
lapply(seq_along(out), function(x) {
assign(NAMES[x], out[[x]], envir=.GlobalEnv)})

#modify by reference, same function applied to different columns for different partitions
#would do this programatically in real use case
virginica[ ,summ:=sum(Petal.Length)]
setosa[ ,summ:=sum(Petal.Width)]

#rbind all (again, programmatic)
do.call(rbind, list(virginica, setosa))

然后我收到以下警告:

 Warning message:
 In `[.data.table`(out$virginica, , `:=`(cumPedal, cumsum(Petal.Width))) :
  Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference.

我知道这与将 data.table 放入列表有关。此用例是否有任何解决方法,或避免使用 split 的方法?请注意,在实际情况下,我想以编程方式通过引用进行修改,因此对解决方案进行硬编码是行不通的。

下面是一个使用 .EACHI 实现您想要做的事情的示例:

## Create a data.table that indicates the pairs of keys to columns
New <- data.table(
  Species = c("virginica", "setosa", "versicolor"), 
  FunCol = c("Petal.Length", "Petal.Width", "Sepal.Length"))

## Set the key of your original data.table
setkey(DT, Species)

## Now use .EACHI
DT[New, temp := cumsum(get(FunCol)), by = .EACHI][]
#      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species  temp
#   1:          5.1         3.5          1.4         0.2    setosa   0.2
#   2:          4.9         3.0          1.4         0.2    setosa   0.4
#   3:          4.7         3.2          1.3         0.2    setosa   0.6
#   4:          4.6         3.1          1.5         0.2    setosa   0.8
#   5:          5.0         3.6          1.4         0.2    setosa   1.0
#  ---                                                                  
# 146:          6.7         3.0          5.2         2.3 virginica 256.9
# 147:          6.3         2.5          5.0         1.9 virginica 261.9
# 148:          6.5         3.0          5.2         2.0 virginica 267.1
# 149:          6.2         3.4          5.4         2.3 virginica 272.5
# 150:          5.9         3.0          5.1         1.8 virginica 277.6

## Basic verification
head(cumsum(DT["setosa", ]$Petal.Width), 5)
# [1] 0.2 0.4 0.6 0.8 1.0
tail(cumsum(DT["virginica", ]$Petal.Length), 5)