R long to wide format factor levels 作为二进制变量和日期
R long to wide format factor levels as binary variables and dates
我想将长格式转换为宽格式,并将因子级别用作二进制变量。这意味着,如果因子 Level 至少存在一次,则变量中应该有一个 1。否则为 0。此外,我希望日期作为变量值 date.1、date.2、...
我有的是以下
data_sample <- data.frame(
PatID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
status = c("COPD", "CPOD", "NA", "NA", "Cardio", "CPOD", "Cardio", "Cardio", "Cerebro")
)
我要的是:
PatID COPD Cardio Cerebro date.COPD.1 date.COPD.2 date.Cardio.1 date.Cardio.2 date.Cerebro.1
1 1 0 0 2016-12-14 2017-02-04 NA NA NA
2 0 1 0 NA NA 2012-03-27 NA NA
3 1 1 1 2012-04-21 NA 2010-02-03 2011-03-05 2014-08-25
需要执行几个步骤,但这应该会为您提供所需的输出。
但是请注意,输入数据中似乎有错别字:我假设您的意思是 "COPD"
而不是 "CPOD"
,因为这是您期望的输出告诉我的。
第一步是使字符串 "NA"
成为显式缺失值,即 NA
.
data_sample[data_sample == "NA"] <- NA
现在使用 data.table::dcast
进行整形。
library(data.table)
setDT(data_sample)
# create id column
data_sample[, id := rowid(status), by = PatID]
dt1 <- dcast(data_sample[!is.na(date)], PatID ~ status, fun.aggregate = function(x) +any(x))
dt2 <- dcast(data_sample[!is.na(date)], PatID ~ paste0("date_", status) + id, value.var = "date")
最后加入两者 data.tables
out <- dt1[dt2, on = 'PatID']
out
# PatID Cardio Cerebro COPD date_COPD_1 date_COPD_2 date_Cardio_1 date_Cardio_2 date_Cerebro_1
#1: 1 0 0 1 2016-12-14 2017-02-04 <NA> <NA> <NA>
#2: 2 1 0 0 <NA> <NA> 2012-27-03 <NA> <NA>
#3: 3 1 1 1 2012-04-21 <NA> 2010-02-03 2011-03-05 2014-08-25
数据
data_sample <- data.frame(
PatID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
status =c("COPD", "COPD", "NA", "NA", "Cardio", "COPD", "Cardio", "Cardio", "Cerebro"))
我想将长格式转换为宽格式,并将因子级别用作二进制变量。这意味着,如果因子 Level 至少存在一次,则变量中应该有一个 1。否则为 0。此外,我希望日期作为变量值 date.1、date.2、...
我有的是以下
data_sample <- data.frame(
PatID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
status = c("COPD", "CPOD", "NA", "NA", "Cardio", "CPOD", "Cardio", "Cardio", "Cerebro")
)
我要的是:
PatID COPD Cardio Cerebro date.COPD.1 date.COPD.2 date.Cardio.1 date.Cardio.2 date.Cerebro.1
1 1 0 0 2016-12-14 2017-02-04 NA NA NA
2 0 1 0 NA NA 2012-03-27 NA NA
3 1 1 1 2012-04-21 NA 2010-02-03 2011-03-05 2014-08-25
需要执行几个步骤,但这应该会为您提供所需的输出。
但是请注意,输入数据中似乎有错别字:我假设您的意思是 "COPD"
而不是 "CPOD"
,因为这是您期望的输出告诉我的。
第一步是使字符串 "NA"
成为显式缺失值,即 NA
.
data_sample[data_sample == "NA"] <- NA
现在使用 data.table::dcast
进行整形。
library(data.table)
setDT(data_sample)
# create id column
data_sample[, id := rowid(status), by = PatID]
dt1 <- dcast(data_sample[!is.na(date)], PatID ~ status, fun.aggregate = function(x) +any(x))
dt2 <- dcast(data_sample[!is.na(date)], PatID ~ paste0("date_", status) + id, value.var = "date")
最后加入两者 data.tables
out <- dt1[dt2, on = 'PatID']
out
# PatID Cardio Cerebro COPD date_COPD_1 date_COPD_2 date_Cardio_1 date_Cardio_2 date_Cerebro_1
#1: 1 0 0 1 2016-12-14 2017-02-04 <NA> <NA> <NA>
#2: 2 1 0 0 <NA> <NA> 2012-27-03 <NA> <NA>
#3: 3 1 1 1 2012-04-21 <NA> 2010-02-03 2011-03-05 2014-08-25
数据
data_sample <- data.frame(
PatID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
status =c("COPD", "COPD", "NA", "NA", "Cardio", "COPD", "Cardio", "Cardio", "Cerebro"))