R 使用 data.table 语法将逻辑列中的正值替换为列名
R replace positive values in logical columns with the column names using data.table syntax
我有一个包含一些逻辑列的数据集,我想用相应的列名替换 'TRUE' 的值。我问了一个类似的问题 ,并且能够在其他 S/O 用户的一些建议的帮助下找到合适的解决方案。但是,该解决方案不使用 data.table 语法并复制整个数据集而不是通过引用替换,这很耗时。
使用 data.table 语法最合适的方法是什么?
我试过这个:
# Load library
library(data.table)
# Create dummy data.table:
mydt <- data.table(id = c(1,2,3,4,5),
ptname = c("jack", "jill", "jo", "frankie", "claire"),
sex = c("m", "f", "f", "m", "f"), apple = c(T,F,F,T,T),
orange = c(F,T,F,T,F),
pear = c(T,T,T,T,F))
# View dummy data:
> mydt
id ptname sex apple orange pear
1: 1 jack m TRUE FALSE TRUE
2: 2 jill f FALSE TRUE TRUE
3: 3 jo f FALSE FALSE TRUE
4: 4 frankie m TRUE TRUE TRUE
5: 5 claire f TRUE FALSE FALSE
# Function to recode values in a data.table:
recode.multi <- function(datacol, oldval, newval) {
trans <- setNames(newval, oldval)
trans[ match(datacol, names(trans)) ]
}
# Get a list of all the logical columns in the data set:
logicalcols <- names(which(mydt[, sapply(mydt, is.logical)] == TRUE))
# Apply the function to convert 'TRUE' to the relevant column names:
mydt[, (logicalcols) := lapply(.SD, recode.multi,
oldval = c(FALSE, TRUE),
newval = c("FALSE", names(.SD))), .SDcols = logicalcols]
# View the result:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple FALSE apple
2: 2 jill f FALSE apple apple
3: 3 jo f FALSE FALSE apple
4: 4 frankie m apple apple apple
5: 5 claire f apple FALSE FALSE
这是不正确的,因为它没有遍历每个列名来获取替换值,它只是回收第一个(在本例中为 "apple")。
此外,如果我颠倒新旧值的顺序,该函数将忽略我对第二个值的字符串替换,并在所有情况下都使用前两个列名作为替换:
# Apply the function with order of old and new values reversed:
mydt[, (logicalcols) := lapply(.SD, recode.multi,
oldval = c(TRUE, FALSE),
newval = c(names(.SD), "FALSE")), .SDcols = logicalcols]
# View the result:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple orange apple
2: 2 jill f orange apple apple
3: 3 jo f orange orange apple
4: 4 frankie m apple apple apple
5: 5 claire f apple orange orange
我确定我可能遗漏了一些简单的东西,但有谁知道为什么该函数不遍历列名(以及如何编辑它来执行此操作)?
我的预期输出如下:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple FALSE pear
2: 2 jill f FALSE orange pear
3: 3 jo f FALSE FALSE pear
4: 4 frankie m apple orange pear
5: 5 claire f apple FALSE FALSE
或者,任何其他简洁的 data.table 语法建议都将不胜感激。
我们可以使用 melt/dcast
方法
dcast(melt(mydt, id.var = c("id", "ptname", "sex"))[,
value1 := as.character(value)][(value), value1 := variable],
id + ptname + sex~variable, value.var = "value1")
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
或者另一种选择是 set
,这样会更有效率
nm1 <- which(unlist(mydt[, lapply(.SD, class)])=="logical")
for(j in nm1){
i1 <- which(mydt[[j]])
set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
set(mydt, i = i1, j=j, value = names(mydt)[j])
}
mydt
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
或者评论中提到的另一种选择是
mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
mydt
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
更新:将选项二和三(由于非逻辑列的数量,一个不可能)与包含 18573 行和 650 列的数据集进行比较,其中 252 列是逻辑运行,时间如下:
# Option 2:
nm1 <- which(unlist(mydt[, lapply(.SD, is.logical)]))
system.time(
for(j in nm1){
i1 <- which(mydt[[j]])
set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
set(mydt, i = i1, j=j, value = names(mydt)[j])
}
)
# user system elapsed
# 0.61 0.00 0.61
# Option 3:
system.time(
mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
)
#user system elapsed
#0.65 0.00 0.66
两者都比不使用 data.table 语法的原始方法快得多:
# Original approach:
logitrue <- which(mydt == TRUE, arr.ind = T)
system.time(
mydt[logitrue, ] <- colnames(mydt)[logitrue[,2]]
)
# user system elapsed
# 1.22 0.03 4.22
我有一个包含一些逻辑列的数据集,我想用相应的列名替换 'TRUE' 的值。我问了一个类似的问题
使用 data.table 语法最合适的方法是什么?
我试过这个:
# Load library
library(data.table)
# Create dummy data.table:
mydt <- data.table(id = c(1,2,3,4,5),
ptname = c("jack", "jill", "jo", "frankie", "claire"),
sex = c("m", "f", "f", "m", "f"), apple = c(T,F,F,T,T),
orange = c(F,T,F,T,F),
pear = c(T,T,T,T,F))
# View dummy data:
> mydt
id ptname sex apple orange pear
1: 1 jack m TRUE FALSE TRUE
2: 2 jill f FALSE TRUE TRUE
3: 3 jo f FALSE FALSE TRUE
4: 4 frankie m TRUE TRUE TRUE
5: 5 claire f TRUE FALSE FALSE
# Function to recode values in a data.table:
recode.multi <- function(datacol, oldval, newval) {
trans <- setNames(newval, oldval)
trans[ match(datacol, names(trans)) ]
}
# Get a list of all the logical columns in the data set:
logicalcols <- names(which(mydt[, sapply(mydt, is.logical)] == TRUE))
# Apply the function to convert 'TRUE' to the relevant column names:
mydt[, (logicalcols) := lapply(.SD, recode.multi,
oldval = c(FALSE, TRUE),
newval = c("FALSE", names(.SD))), .SDcols = logicalcols]
# View the result:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple FALSE apple
2: 2 jill f FALSE apple apple
3: 3 jo f FALSE FALSE apple
4: 4 frankie m apple apple apple
5: 5 claire f apple FALSE FALSE
这是不正确的,因为它没有遍历每个列名来获取替换值,它只是回收第一个(在本例中为 "apple")。
此外,如果我颠倒新旧值的顺序,该函数将忽略我对第二个值的字符串替换,并在所有情况下都使用前两个列名作为替换:
# Apply the function with order of old and new values reversed:
mydt[, (logicalcols) := lapply(.SD, recode.multi,
oldval = c(TRUE, FALSE),
newval = c(names(.SD), "FALSE")), .SDcols = logicalcols]
# View the result:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple orange apple
2: 2 jill f orange apple apple
3: 3 jo f orange orange apple
4: 4 frankie m apple apple apple
5: 5 claire f apple orange orange
我确定我可能遗漏了一些简单的东西,但有谁知道为什么该函数不遍历列名(以及如何编辑它来执行此操作)?
我的预期输出如下:
> mydt
id ptname sex apple orange pear
1: 1 jack m apple FALSE pear
2: 2 jill f FALSE orange pear
3: 3 jo f FALSE FALSE pear
4: 4 frankie m apple orange pear
5: 5 claire f apple FALSE FALSE
或者,任何其他简洁的 data.table 语法建议都将不胜感激。
我们可以使用 melt/dcast
方法
dcast(melt(mydt, id.var = c("id", "ptname", "sex"))[,
value1 := as.character(value)][(value), value1 := variable],
id + ptname + sex~variable, value.var = "value1")
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
或者另一种选择是 set
,这样会更有效率
nm1 <- which(unlist(mydt[, lapply(.SD, class)])=="logical")
for(j in nm1){
i1 <- which(mydt[[j]])
set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
set(mydt, i = i1, j=j, value = names(mydt)[j])
}
mydt
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
或者评论中提到的另一种选择是
mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
mydt
# id ptname sex apple orange pear
#1: 1 jack m apple FALSE pear
#2: 2 jill f FALSE orange pear
#3: 3 jo f FALSE FALSE pear
#4: 4 frankie m apple orange pear
#5: 5 claire f apple FALSE FALSE
更新:将选项二和三(由于非逻辑列的数量,一个不可能)与包含 18573 行和 650 列的数据集进行比较,其中 252 列是逻辑运行,时间如下:
# Option 2:
nm1 <- which(unlist(mydt[, lapply(.SD, is.logical)]))
system.time(
for(j in nm1){
i1 <- which(mydt[[j]])
set(mydt, i=NULL, j=j, value = as.character(mydt[[j]]))
set(mydt, i = i1, j=j, value = names(mydt)[j])
}
)
# user system elapsed
# 0.61 0.00 0.61
# Option 3:
system.time(
mydt[, (nm1) := Map(function(x,y) replace(x, x, y), .SD, names(mydt)[nm1]), .SDcols = nm1]
)
#user system elapsed
#0.65 0.00 0.66
两者都比不使用 data.table 语法的原始方法快得多:
# Original approach:
logitrue <- which(mydt == TRUE, arr.ind = T)
system.time(
mydt[logitrue, ] <- colnames(mydt)[logitrue[,2]]
)
# user system elapsed
# 1.22 0.03 4.22