如何将 data.table 和基本 r 函数组合在一起?
How to chain together a mix of data.table and base r functions?
我正在使用 data.table 包来处理非常大的数据集,并且重视它的速度和清晰度。但我是新手,很难将函数链接在一起,尤其是在使用 data.table 和基本 R 函数的混合集时。我的问题是,如何将下面的示例函数链接在一起,形成一个无缝的代码串来定义目标 data
对象?
下面是正确的输出,由 运行 每行代码分别(未链接)生成,生成代码显示在输出的正下方:
> data
ID Period State Values
1: 1 1 X0 5
2: 1 2 X1 0
3: 1 3 X2 0
4: 1 4 X1 0
5: 2 1 X0 1
6: 2 2 XX 0
7: 2 3 XX 0
8: 2 4 XX 0
9: 3 1 X2 0
10: 3 2 X1 0
11: 3 3 X9 0
12: 3 4 X3 0
13: 4 1 X2 1
14: 4 2 X1 2
15: 4 3 X9 3
16: 4 4 XX 0
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
)
# changes State to "XX" if remaining Values_1 + Values_2 cumulative sums = 0 for each ID:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID]
# create new column "Values", which equals "Values_1":
setDT(data)[,Values := Values_1]
# in base R, drops columns Values_1 and Values_2:
data <- subset(data, select = -c(Values_1,Values_2)) # How to do this step in data.table, if possible or advisable?
# in base R, changes all "XX" elements in State column to "HI":
data$State <- gsub('XX','HI', data$State) # How to do this step in data.table, if possible or advisable?
对于它的价值,下面是我尝试使用“%>%”管道运算符链接在一起的尝试,但失败了(错误消息 data$State 中的错误:'closure' 类型的对象不是子集),虽然我宁愿使用data.table运算符链接在一起:
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
setDT(data)[,Values := Values_1] %>%
subset(data, select = -c(Values_1,Values_2)) %>%
data$State <- gsub('XX','HI', data$State)
您可以使用 magrittr
包在 [
之前使用 .
链接 data.tables。试试下面的代码:
library(dplyr)
library(magrittr)
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data) %>%
.[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
.[,Values := Values_1] %>%
select(-c(Values_1, Values_2)) %>%
mutate(State = gsub('XX','HI', State))
输出:
rn ID Period State Values
1: 1 1 1 X0 5
2: 2 1 2 X1 0
3: 3 1 3 X2 0
4: 4 1 4 X1 0
5: 5 2 1 X0 1
6: 6 2 2 HI 0
7: 7 2 3 HI 0
8: 8 2 4 HI 0
9: 9 3 1 X2 0
10: 10 3 2 X1 0
11: 11 3 3 X9 0
12: 12 3 4 X3 0
13: 13 4 1 X2 1
14: 14 4 2 X1 2
15: 15 4 3 X9 3
16: 16 4 4 HI 0
您可以只使用括号符号进行链接 [
。这样你只需要调用 setDT()
一次,因为你正在继续 data.table
宇宙中的所有操作,所以 data
不会停止成为 data.table
。另外 setDT()
就地修改,所以它不需要分配(尽管通过管道将其 return 值分配给 data
这也很好)。
首先定义数据,使其成为data.table
:
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0", "X1", "X2", "X1", "X0", "X2", "X0", "X0", "X2", "X1", "X9", "X3", "X2", "X1", "X9", "X3")
) |>
setDT()
然后定义您需要的列。请注意 apply a function on several columns.
的功能符号
data[, `:=`(
State = ifelse(
rev(cumsum(rev(Values_1 + Values_2))),
State, "XX"
)
),
by = ID
][
,
`:=`(
Values = Values_1,
Values_1 = NULL,
Values_2 = NULL,
State = gsub("XX", "HI", State)
)
]
输出:
data
# ID Period State Values
# 1: 1 1 X0 5
# 2: 1 2 X1 0
# 3: 1 3 X2 0
# 4: 1 4 X1 0
# 5: 2 1 X0 1
# 6: 2 2 HI 0
# 7: 2 3 HI 0
# 8: 2 4 HI 0
# 9: 3 1 X2 0
# 10: 3 2 X1 0
# 11: 3 3 X9 0
# 12: 3 4 X3 0
# 13: 4 1 X2 1
# 14: 4 2 X1 2
# 15: 4 3 X9 3
# 16: 4 4 HI 0
您可能想进一步阅读有关 chaining commands in data.table 的内容。我认为该页面对包的语法和功能进行了很好的总结,值得一读。
如果我没理解错的话,OP 想要
- 将列
Value_1
重命名为 Value
(或者用 OP 的话说:创建新列“值”,等于“Values_1” )
- 删除列
Value_2
- 将第
State
列中出现的所有 XX
替换为 HI
这是我在 data.table 语法中要做的:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
, Values_2 := NULL][
State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
ID Period Values State
1: 1 1 5 X0
2: 1 2 0 X1
3: 1 3 0 X2
4: 1 4 0 X1
5: 2 1 1 X0
6: 2 2 0 HI
7: 2 3 0 HI
8: 2 4 0 HI
9: 3 1 0 X2
10: 3 2 0 X1
11: 3 3 0 X9
12: 3 4 0 X3
13: 4 1 1 X2
14: 4 2 2 X1
15: 4 3 3 X9
16: 4 4 0 HI
setnames()
通过引用 更新 ,例如,无需复制。无需创建 Values_1
的副本并稍后删除 Values_1
。
此外,[State == "XX", State := "HI"]
将 XX
替换为 HI
仅在受影响的行中 通过引用而
[, State := gsub('XX','HI', State)]
替换整列。
data.table 在适当的地方使用链接。
顺便说一句:我想知道为什么不能在第一条语句中立即将 XX
替换为 HI
:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
, Values_2 := NULL][]
setnames(data, "Values_1", "Values")
我正在使用 data.table 包来处理非常大的数据集,并且重视它的速度和清晰度。但我是新手,很难将函数链接在一起,尤其是在使用 data.table 和基本 R 函数的混合集时。我的问题是,如何将下面的示例函数链接在一起,形成一个无缝的代码串来定义目标 data
对象?
下面是正确的输出,由 运行 每行代码分别(未链接)生成,生成代码显示在输出的正下方:
> data
ID Period State Values
1: 1 1 X0 5
2: 1 2 X1 0
3: 1 3 X2 0
4: 1 4 X1 0
5: 2 1 X0 1
6: 2 2 XX 0
7: 2 3 XX 0
8: 2 4 XX 0
9: 3 1 X2 0
10: 3 2 X1 0
11: 3 3 X9 0
12: 3 4 X3 0
13: 4 1 X2 1
14: 4 2 X1 2
15: 4 3 X9 3
16: 4 4 XX 0
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
)
# changes State to "XX" if remaining Values_1 + Values_2 cumulative sums = 0 for each ID:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID]
# create new column "Values", which equals "Values_1":
setDT(data)[,Values := Values_1]
# in base R, drops columns Values_1 and Values_2:
data <- subset(data, select = -c(Values_1,Values_2)) # How to do this step in data.table, if possible or advisable?
# in base R, changes all "XX" elements in State column to "HI":
data$State <- gsub('XX','HI', data$State) # How to do this step in data.table, if possible or advisable?
对于它的价值,下面是我尝试使用“%>%”管道运算符链接在一起的尝试,但失败了(错误消息 data$State 中的错误:'closure' 类型的对象不是子集),虽然我宁愿使用data.table运算符链接在一起:
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
setDT(data)[,Values := Values_1] %>%
subset(data, select = -c(Values_1,Values_2)) %>%
data$State <- gsub('XX','HI', data$State)
您可以使用 magrittr
包在 [
之前使用 .
链接 data.tables。试试下面的代码:
library(dplyr)
library(magrittr)
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data) %>%
.[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
.[,Values := Values_1] %>%
select(-c(Values_1, Values_2)) %>%
mutate(State = gsub('XX','HI', State))
输出:
rn ID Period State Values
1: 1 1 1 X0 5
2: 2 1 2 X1 0
3: 3 1 3 X2 0
4: 4 1 4 X1 0
5: 5 2 1 X0 1
6: 6 2 2 HI 0
7: 7 2 3 HI 0
8: 8 2 4 HI 0
9: 9 3 1 X2 0
10: 10 3 2 X1 0
11: 11 3 3 X9 0
12: 12 3 4 X3 0
13: 13 4 1 X2 1
14: 14 4 2 X1 2
15: 15 4 3 X9 3
16: 16 4 4 HI 0
您可以只使用括号符号进行链接 [
。这样你只需要调用 setDT()
一次,因为你正在继续 data.table
宇宙中的所有操作,所以 data
不会停止成为 data.table
。另外 setDT()
就地修改,所以它不需要分配(尽管通过管道将其 return 值分配给 data
这也很好)。
首先定义数据,使其成为data.table
:
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0", "X1", "X2", "X1", "X0", "X2", "X0", "X0", "X2", "X1", "X9", "X3", "X2", "X1", "X9", "X3")
) |>
setDT()
然后定义您需要的列。请注意 apply a function on several columns.
的功能符号data[, `:=`(
State = ifelse(
rev(cumsum(rev(Values_1 + Values_2))),
State, "XX"
)
),
by = ID
][
,
`:=`(
Values = Values_1,
Values_1 = NULL,
Values_2 = NULL,
State = gsub("XX", "HI", State)
)
]
输出:
data
# ID Period State Values
# 1: 1 1 X0 5
# 2: 1 2 X1 0
# 3: 1 3 X2 0
# 4: 1 4 X1 0
# 5: 2 1 X0 1
# 6: 2 2 HI 0
# 7: 2 3 HI 0
# 8: 2 4 HI 0
# 9: 3 1 X2 0
# 10: 3 2 X1 0
# 11: 3 3 X9 0
# 12: 3 4 X3 0
# 13: 4 1 X2 1
# 14: 4 2 X1 2
# 15: 4 3 X9 3
# 16: 4 4 HI 0
您可能想进一步阅读有关 chaining commands in data.table 的内容。我认为该页面对包的语法和功能进行了很好的总结,值得一读。
如果我没理解错的话,OP 想要
- 将列
Value_1
重命名为Value
(或者用 OP 的话说:创建新列“值”,等于“Values_1” ) - 删除列
Value_2
- 将第
State
列中出现的所有XX
替换为HI
这是我在 data.table 语法中要做的:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
, Values_2 := NULL][
State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
ID Period Values State 1: 1 1 5 X0 2: 1 2 0 X1 3: 1 3 0 X2 4: 1 4 0 X1 5: 2 1 1 X0 6: 2 2 0 HI 7: 2 3 0 HI 8: 2 4 0 HI 9: 3 1 0 X2 10: 3 2 0 X1 11: 3 3 0 X9 12: 3 4 0 X3 13: 4 1 1 X2 14: 4 2 2 X1 15: 4 3 3 X9 16: 4 4 0 HI
setnames()
通过引用 更新 ,例如,无需复制。无需创建 Values_1
的副本并稍后删除 Values_1
。
此外,[State == "XX", State := "HI"]
将 XX
替换为 HI
仅在受影响的行中 通过引用而
[, State := gsub('XX','HI', State)]
替换整列。
data.table 在适当的地方使用链接。
顺便说一句:我想知道为什么不能在第一条语句中立即将 XX
替换为 HI
:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
, Values_2 := NULL][]
setnames(data, "Values_1", "Values")