data.table: 根据指标列值和名称创建新的字符列
data.table: Create new character column based on indicator columns values and names
我有一个包含 1.6x10^8 条记录的 data.table,我想根据值为 1 的指标列名称创建一个新的字符列。
例如,
library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b"),
drugA=c(1,1,1,0,0),
drugB=c(0,1,1,1,0),
drugC=c(0,0,1,0,1))
ID drugA drugB drugC
1: a 1 0 0
2: a 1 1 0
3: a 1 1 1
4: b 0 1 0
5: b 0 0 1
### NOTE: I know the paste0(...,collapse) argument might be helpful in concatenating the drug names as an intermediate step
ID drugA drugB drugC exposure
1: a 1 0 0 drugA
2: a 1 1 0 drugA+drugB
3: a 1 1 1 drugA+drugB+drugC
4: b 0 1 0 drugB
5: b 0 0 1 drugC
我希望它尽可能的健壮和干净,完全依赖 data.table 语法 and/or 其他有用的 packages/functions (例如 dcast);我想避免创建广泛的用户定义函数,因为考虑到我的 data.table 大小,运行.
需要很长时间
我查看了其他帖子,但找不到与我的情况和所需输出类似的内容。
如有任何帮助,我们将不胜感激。
我们可以按行顺序分组,将.SDcols
指定为'drug'列,将Data.table(.SD
)的子集转换为logical
,使用它来对列名称进行子集化,并将它们 paste
放在一起
library(data.table)
DT[, exposure := paste(names(.SD)[as.logical(.SD)], collapse= '+'),
1:nrow(DT), .SDcols = drugA:drugC]
DT
# ID drugA drugB drugC exposure
#1: a 1 0 0 drugA
#2: a 1 1 0 drugA+drugB
#3: a 1 1 1 drugA+drugB+drugC
#4: b 0 1 0 drugB
#5: b 0 0 1 drugC
或者不按行分组,我们可以遍历列,将值更改为列名,然后 paste
和 do.call
并删除 NA
元素 gsub
DT[, exposure := gsub("NA\+|\+NA", "", do.call(paste,
c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD,
seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]
library('data.table')
DT[, id := .I]
df <- melt(DT, id.vars = 'id', measure.vars = c("drugA", "drugB", "drugC"))
df[value == 1, expose := 'exposure']
df[value == 0, expose := 'no_exposure'][, value := NULL]
df <- dcast(df, id ~ expose, fun.aggregate = function(x) paste0(x, collapse = "+"), value.var = 'variable')
DT[df, on = 'id'][, id := NULL][]
# ID drugA drugB drugC exposure no_exposure
# 1: a 1 0 0 drugA drugB+drugC
# 2: a 1 1 0 drugA+drugB drugC
# 3: a 1 1 1 drugA+drugB+drugC
# 4: b 0 1 0 drugB drugA+drugC
# 5: b 0 0 1 drugC drugA+drugB
我有一个包含 1.6x10^8 条记录的 data.table,我想根据值为 1 的指标列名称创建一个新的字符列。
例如,
library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b"),
drugA=c(1,1,1,0,0),
drugB=c(0,1,1,1,0),
drugC=c(0,0,1,0,1))
ID drugA drugB drugC
1: a 1 0 0
2: a 1 1 0
3: a 1 1 1
4: b 0 1 0
5: b 0 0 1
### NOTE: I know the paste0(...,collapse) argument might be helpful in concatenating the drug names as an intermediate step
ID drugA drugB drugC exposure
1: a 1 0 0 drugA
2: a 1 1 0 drugA+drugB
3: a 1 1 1 drugA+drugB+drugC
4: b 0 1 0 drugB
5: b 0 0 1 drugC
我希望它尽可能的健壮和干净,完全依赖 data.table 语法 and/or 其他有用的 packages/functions (例如 dcast);我想避免创建广泛的用户定义函数,因为考虑到我的 data.table 大小,运行.
需要很长时间我查看了其他帖子,但找不到与我的情况和所需输出类似的内容。
如有任何帮助,我们将不胜感激。
我们可以按行顺序分组,将.SDcols
指定为'drug'列,将Data.table(.SD
)的子集转换为logical
,使用它来对列名称进行子集化,并将它们 paste
放在一起
library(data.table)
DT[, exposure := paste(names(.SD)[as.logical(.SD)], collapse= '+'),
1:nrow(DT), .SDcols = drugA:drugC]
DT
# ID drugA drugB drugC exposure
#1: a 1 0 0 drugA
#2: a 1 1 0 drugA+drugB
#3: a 1 1 1 drugA+drugB+drugC
#4: b 0 1 0 drugB
#5: b 0 0 1 drugC
或者不按行分组,我们可以遍历列,将值更改为列名,然后 paste
和 do.call
并删除 NA
元素 gsub
DT[, exposure := gsub("NA\+|\+NA", "", do.call(paste,
c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD,
seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]
library('data.table')
DT[, id := .I]
df <- melt(DT, id.vars = 'id', measure.vars = c("drugA", "drugB", "drugC"))
df[value == 1, expose := 'exposure']
df[value == 0, expose := 'no_exposure'][, value := NULL]
df <- dcast(df, id ~ expose, fun.aggregate = function(x) paste0(x, collapse = "+"), value.var = 'variable')
DT[df, on = 'id'][, id := NULL][]
# ID drugA drugB drugC exposure no_exposure
# 1: a 1 0 0 drugA drugB+drugC
# 2: a 1 1 0 drugA+drugB drugC
# 3: a 1 1 1 drugA+drugB+drugC
# 4: b 0 1 0 drugB drugA+drugC
# 5: b 0 0 1 drugC drugA+drugB