如何将 data.table 和基本 r 函数组合在一起?

How to chain together a mix of data.table and base r functions?

我正在使用 data.table 包来处理非常大的数据集,并且重视它的速度和清晰度。但我是新手,很难将函数链接在一起,尤其是在使用 data.table 和基本 R 函数的混合集时。我的问题是,如何将下面的示例函数链接在一起,形成一个无缝的代码串来定义目标 data 对象?

下面是正确的输出,由 运行 每行代码分别(未链接)生成,生成代码显示在输出的正下方:

> data
    ID Period State Values
 1:  1      1    X0      5
 2:  1      2    X1      0
 3:  1      3    X2      0
 4:  1      4    X1      0
 5:  2      1    X0      1
 6:  2      2    XX      0
 7:  2      3    XX      0
 8:  2      4    XX      0
 9:  3      1    X2      0
10:  3      2    X1      0
11:  3      3    X9      0
12:  3      4    X3      0
13:  4      1    X2      1
14:  4      2    X1      2
15:  4      3    X9      3
16:  4      4    XX      0

library(data.table)

data <- 
  data.frame(
    ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
    Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
    Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
    Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
    State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
  )

# changes State to "XX" if remaining Values_1 + Values_2 cumulative sums = 0 for each ID: 
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID]

# create new column "Values", which equals "Values_1":
setDT(data)[,Values := Values_1] 

# in base R, drops columns Values_1 and Values_2:
data <- subset(data, select = -c(Values_1,Values_2)) # How to do this step in data.table, if possible or advisable?  

# in base R, changes all "XX" elements in State column to "HI":
data$State <- gsub('XX','HI', data$State) # How to do this step in data.table, if possible or advisable?  

对于它的价值,下面是我尝试使用“%>%”管道运算符链接在一起的尝试,但失败了(错误消息 data$State 中的错误:'closure' 类型的对象不是子集),虽然我宁愿使用data.table运算符链接在一起:

data <- 
  data.frame(
    ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
    Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
    Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
    Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
    State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
  ) %>%
  setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
  setDT(data)[,Values := Values_1] %>%
  subset(data, select = -c(Values_1,Values_2)) %>%
  data$State <- gsub('XX','HI', data$State)

您可以使用 magrittr 包在 [ 之前使用 . 链接 data.tables。试试下面的代码:

library(dplyr)
library(magrittr)
library(data.table)
data <- 
  data.frame(
    ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
    Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
    Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
    Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
    State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
  ) %>% 
  setDT(data) %>%
  .[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
  .[,Values := Values_1] %>%
  select(-c(Values_1, Values_2)) %>%
  mutate(State = gsub('XX','HI', State))

输出:

   rn ID Period State Values
 1:  1  1      1    X0      5
 2:  2  1      2    X1      0
 3:  3  1      3    X2      0
 4:  4  1      4    X1      0
 5:  5  2      1    X0      1
 6:  6  2      2    HI      0
 7:  7  2      3    HI      0
 8:  8  2      4    HI      0
 9:  9  3      1    X2      0
10: 10  3      2    X1      0
11: 11  3      3    X9      0
12: 12  3      4    X3      0
13: 13  4      1    X2      1
14: 14  4      2    X1      2
15: 15  4      3    X9      3
16: 16  4      4    HI      0

您可以只使用括号符号进行链接 [。这样你只需要调用 setDT() 一次,因为你正在继续 data.table 宇宙中的所有操作,所以 data 不会停止成为 data.table。另外 setDT() 就地修改,所以它不需要分配(尽管通过管道将其 return 值分配给 data 这也很好)。

首先定义数据,使其成为data.table:


library(data.table)
data <-
    data.frame(
        ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
        Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
        Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
        Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
        State = c("X0", "X1", "X2", "X1", "X0", "X2", "X0", "X0", "X2", "X1", "X9", "X3", "X2", "X1", "X9", "X3")
    ) |>
    setDT()

然后定义您需要的列。请注意 apply a function on several columns.

的功能符号
data[, `:=`(
    State = ifelse(
        rev(cumsum(rev(Values_1 + Values_2))),
        State, "XX"
    )
),
by = ID
][
    ,
    `:=`(
        Values = Values_1,
        Values_1 = NULL,
        Values_2 = NULL,
        State = gsub("XX", "HI", State)
    )
]

输出:

data
#     ID Period State Values
#  1:  1      1    X0      5
#  2:  1      2    X1      0
#  3:  1      3    X2      0
#  4:  1      4    X1      0
#  5:  2      1    X0      1
#  6:  2      2    HI      0
#  7:  2      3    HI      0
#  8:  2      4    HI      0
#  9:  3      1    X2      0
# 10:  3      2    X1      0
# 11:  3      3    X9      0
# 12:  3      4    X3      0
# 13:  4      1    X2      1
# 14:  4      2    X1      2
# 15:  4      3    X9      3
# 16:  4      4    HI      0

您可能想进一步阅读有关 chaining commands in data.table 的内容。我认为该页面对包的语法和功能进行了很好的总结,值得一读。

如果我没理解错的话,OP 想要

  • 将列 Value_1 重命名为 Value(或者用 OP 的话说:创建新列“值”,等于“Values_1” )
  • 删除列 Value_2
  • 将第 State 列中出现的所有 XX 替换为 HI

这是我在 data.table 语法中要做的:

setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
  , Values_2 := NULL][
    State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
       ID Period Values  State
 1:     1      1      5     X0
 2:     1      2      0     X1
 3:     1      3      0     X2
 4:     1      4      0     X1
 5:     2      1      1     X0
 6:     2      2      0     HI
 7:     2      3      0     HI
 8:     2      4      0     HI
 9:     3      1      0     X2
10:     3      2      0     X1
11:     3      3      0     X9
12:     3      4      0     X3
13:     4      1      1     X2
14:     4      2      2     X1
15:     4      3      3     X9
16:     4      4      0     HI

setnames() 通过引用 更新 ,例如,无需复制。无需创建 Values_1 的副本并稍后删除 Values_1

此外,[State == "XX", State := "HI"]XX 替换为 HI 仅在受影响的行中 通过引用而
[, State := gsub('XX','HI', State)] 替换整列。

data.table 在适当的地方使用链接。

顺便说一句:我想知道为什么不能在第一条语句中立即将 XX 替换为 HI

setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
  , Values_2 := NULL][]
setnames(data, "Values_1", "Values")