使用 data.table 创建序列
Create sequence with data.table
我有一个data.table格式
id | pet | name
2011-01-01 | "dog" | "a"
2011-01-02 | "dog" | "b"
2011-01-03 | "cat" | "c"
2011-01-04 | "dog" | "a"
2011-01-05 | "dog" | "some"
2011-01-06 | "cat" | "thing"
我想执行一个聚合,将出现在 cat 之前的所有狗名连接起来,例如,
id | pet | name | prior
2011-01-01 | "dog" | "a" |
2011-01-02 | "dog" | "b" |
2011-01-03 | "cat" | "c" | "a b"
2011-01-04 | "dog" | "a" |
2011-01-05 | "dog" | "some" |
2011-01-06 | "cat" | "thing" | "a some"
尝试
library(data.table)#v1.9.5+
setDT(df1)[, prior:= paste(name[1:(.N-1)], collapse=' ') ,
.(group=cumsum(c(0,diff(pet=='cat'))<0))][pet!='cat', prior:= '']
# id pet name prior
#1: 2011-01-01 dog a
#2: 2011-01-02 dog b
#3: 2011-01-03 cat c a b
#4: 2011-01-04 dog a
#5: 2011-01-05 dog some
#6: 2011-01-06 cat thing a some
或者 shift
的可能解决方案(在开发版本中引入,即 v1.9.5),灵感来自 @David Arenburg 的 post。安装开发版本的说明是 here.
setDT(df1)[, prior := paste(name[-.N], collapse= ' '),
.(group=cumsum(shift(pet, fill='cat')=='cat'))][pet!='cat', prior := '']
数据
df1 <- structure(list(id = c("2011-01-01 ", "2011-01-02 ", "2011-01-03 ",
"2011-01-04 ", "2011-01-05 ", "2011-01-06 "), pet = c("dog",
"dog", "cat", "dog", "dog", "cat"), name = c("a", "b", "c", "a",
"some", "thing")), .Names = c("id", "pet", "name"), row.names = c(NA,
-6L), class = "data.frame")
还有一个选择
indx <- setDT(DT)[, list(.I[.N], paste(name[-.N], collapse = ' ')),
by = list(c(0L, cumsum(pet == "cat")[-nrow(DT)]))]
DT[indx$V1, prior := indx$V2]
DT
# id pet name prior
# 1: 2011-01-01 dog a NA
# 2: 2011-01-02 dog b NA
# 3: 2011-01-03 cat c a b
# 4: 2011-01-04 dog a NA
# 5: 2011-01-05 dog some NA
# 6: 2011-01-06 cat thing a some
我 运行 我的数据集上的每个解决方案,并将 运行 次与 rbenchmark 进行比较。
我无法共享数据集,但这里有一些基本信息:
dim(event_source_causal_parts)
[1] 311127 4
比较代码,
require(rbenchmark)
benchmark({
event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)]
setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(c(0,diff(Source == "Warranty")) < 0))][Source != 'Warranty', prior := '']
})
benchmark({
event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)]
setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(shift(Source, fill="Warranty") == "Warranty"))][Source != 'Warranty', prior := '']
})
benchmark({
event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)]
indx <- setDT(event_source_causal_parts)[, list(.I[.N], paste(Causal_Part_Number[-.N], collapse = " ")),
by = list(c(0L, cumsum(Source == "Warranty")[-nrow(event_source_causal_parts)]))]
})
结果如下,
replications elapsed relative user.self sys.self user.child sys.child
1 100 12.91 1 12.76 0.05 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 12.7 1 12.66 0.05 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 61.97 1 61.65 0 NA NA
我的环境,
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rbenchmark_1.0.0 stringr_0.6.2 data.table_1.9.5 vimcom_1.2-6
loaded via a namespace (and not attached):
[1] chron_2.3-45 grid_3.1.2 lattice_0.20-30 tools_3.1.2 zoo_1.7-11
R 使用了英特尔 MKL 数学库。
基于这些结果,我认为@ak运行 的第二个解决方案是最快的。
我再次 运行 测试,但现在我用 -O3 重新编译 data.table 并将 R 更新为 3.2.0。结果大相径庭:
replications elapsed relative user.self sys.self user.child sys.child
1 100 21.22 1 20.73 0.48 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 11.31 1 10.39 0.92 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 35.77 1 35.53 0.25 NA NA
所以最好的解决方案在带有 O3 的新 R 下甚至更快,但第二好的解决方案要慢得多。
我有一个data.table格式
id | pet | name
2011-01-01 | "dog" | "a"
2011-01-02 | "dog" | "b"
2011-01-03 | "cat" | "c"
2011-01-04 | "dog" | "a"
2011-01-05 | "dog" | "some"
2011-01-06 | "cat" | "thing"
我想执行一个聚合,将出现在 cat 之前的所有狗名连接起来,例如,
id | pet | name | prior
2011-01-01 | "dog" | "a" |
2011-01-02 | "dog" | "b" |
2011-01-03 | "cat" | "c" | "a b"
2011-01-04 | "dog" | "a" |
2011-01-05 | "dog" | "some" |
2011-01-06 | "cat" | "thing" | "a some"
尝试
library(data.table)#v1.9.5+
setDT(df1)[, prior:= paste(name[1:(.N-1)], collapse=' ') ,
.(group=cumsum(c(0,diff(pet=='cat'))<0))][pet!='cat', prior:= '']
# id pet name prior
#1: 2011-01-01 dog a
#2: 2011-01-02 dog b
#3: 2011-01-03 cat c a b
#4: 2011-01-04 dog a
#5: 2011-01-05 dog some
#6: 2011-01-06 cat thing a some
或者 shift
的可能解决方案(在开发版本中引入,即 v1.9.5),灵感来自 @David Arenburg 的 post。安装开发版本的说明是 here.
setDT(df1)[, prior := paste(name[-.N], collapse= ' '),
.(group=cumsum(shift(pet, fill='cat')=='cat'))][pet!='cat', prior := '']
数据
df1 <- structure(list(id = c("2011-01-01 ", "2011-01-02 ", "2011-01-03 ",
"2011-01-04 ", "2011-01-05 ", "2011-01-06 "), pet = c("dog",
"dog", "cat", "dog", "dog", "cat"), name = c("a", "b", "c", "a",
"some", "thing")), .Names = c("id", "pet", "name"), row.names = c(NA,
-6L), class = "data.frame")
还有一个选择
indx <- setDT(DT)[, list(.I[.N], paste(name[-.N], collapse = ' ')),
by = list(c(0L, cumsum(pet == "cat")[-nrow(DT)]))]
DT[indx$V1, prior := indx$V2]
DT
# id pet name prior
# 1: 2011-01-01 dog a NA
# 2: 2011-01-02 dog b NA
# 3: 2011-01-03 cat c a b
# 4: 2011-01-04 dog a NA
# 5: 2011-01-05 dog some NA
# 6: 2011-01-06 cat thing a some
我 运行 我的数据集上的每个解决方案,并将 运行 次与 rbenchmark 进行比较。
我无法共享数据集,但这里有一些基本信息:
dim(event_source_causal_parts)
[1] 311127 4
比较代码,
require(rbenchmark)
benchmark({
event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)]
setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(c(0,diff(Source == "Warranty")) < 0))][Source != 'Warranty', prior := '']
})
benchmark({
event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)]
setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(shift(Source, fill="Warranty") == "Warranty"))][Source != 'Warranty', prior := '']
})
benchmark({
event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)]
indx <- setDT(event_source_causal_parts)[, list(.I[.N], paste(Causal_Part_Number[-.N], collapse = " ")),
by = list(c(0L, cumsum(Source == "Warranty")[-nrow(event_source_causal_parts)]))]
})
结果如下,
replications elapsed relative user.self sys.self user.child sys.child
1 100 12.91 1 12.76 0.05 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 12.7 1 12.66 0.05 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 61.97 1 61.65 0 NA NA
我的环境,
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rbenchmark_1.0.0 stringr_0.6.2 data.table_1.9.5 vimcom_1.2-6
loaded via a namespace (and not attached):
[1] chron_2.3-45 grid_3.1.2 lattice_0.20-30 tools_3.1.2 zoo_1.7-11
R 使用了英特尔 MKL 数学库。
基于这些结果,我认为@ak运行 的第二个解决方案是最快的。
我再次 运行 测试,但现在我用 -O3 重新编译 data.table 并将 R 更新为 3.2.0。结果大相径庭:
replications elapsed relative user.self sys.self user.child sys.child
1 100 21.22 1 20.73 0.48 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 11.31 1 10.39 0.92 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 35.77 1 35.53 0.25 NA NA
所以最好的解决方案在带有 O3 的新 R 下甚至更快,但第二好的解决方案要慢得多。