将多列组合成整齐的数据
Combine Multiple Columns Into Tidy Data
我的数据集如下所示:
unique.id abx.1 start.1 stop.1 abx.2 start.2 stop.2 abx.3 start.3 stop.3 abx.4 start.4
1 1 Moxi 2014-01-01 2014-01-07 PenG 2014-01-01 2014-01-07 Vanco 2014-01-01 2014-01-07 Moxi 2014-01-01
2 2 Moxi 2014-01-01 2014-01-02 Cipro 2014-01-01 2014-01-02 PenG 2014-01-01 2014-01-02 Vanco 2014-01-01
3 3 Cipro 2014-01-01 2014-01-05 Vanco 2014-01-01 2014-01-05 Cipro 2014-01-01 2014-01-05 Vanco 2014-01-01
4 4 Vanco 2014-01-02 2014-01-03 Cipro 2014-01-02 2014-01-03 Cipro 2014-01-02 2014-01-03 PenG 2014-01-02
5 5 Vanco 2014-01-01 2014-01-02 PenG 2014-01-01 2014-01-02 PenG 2014-01-01 2014-01-02 Cipro 2014-01-01
stop.4 intervention
1 2014-01-07 0
2 2014-01-02 0
3 2014-01-05 1
4 2014-01-03 1
5 2014-01-02 0
用一些代码来创建这个:
abxoptions <- c("Cipro", "Moxi", "PenG", "Vanco")
df3 <- data.frame(
unique.id = 1:5,
abx.1 = sample(abxoptions,5, replace=TRUE),
start.1 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
stop.1 = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
abx.2 = sample(abxoptions,5, replace=TRUE),
start.2 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
stop.2 = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
abx.3 = sample(abxoptions,5, replace=TRUE),
start.3 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
stop.3 = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
abx.4 = sample(abxoptions,5, replace=TRUE),
start.4 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
stop.4 = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
intervention = c(0,0,1,1,0)
)
我想将此数据整理成如下所示:
unique.id abx start stop intervention
1 Moxi 2014-01-10 2014-01-07 0
1 Pen G 2014-01-01 2014-01-07 0
1 Vanco 2014-01-01 2014-01-07 0
1 Moxi 2014-01-01 2014-01-07 0 etc etc
以下解决方案无法满足我的需求:
Gather multiple sets of columns 和
我怀疑 Hadley 惊人的 tidyr pakcage 是可行的方法...只是想不通。任何帮助将不胜感激。
您可以尝试 reshape
来自 base R
reshape(df3, direction='long', varying=2:ncol(df3), sep=".")
或使用 splitstackshape
中的 merged.stack
library(splitstackshape)
merged.stack(df3, var.stubs=c('abx', 'start', 'stop'), sep='.')[,
c('start', 'stop') := lapply(.SD, as.Date,
origin='1970-01-01'), .SDcols=4:5][]
几乎所有数据整理问题都可以通过三个步骤解决:
- 收集所有非变量列
- 将 "colname" 列分成多个变量
- 重新传播数据
(通常您只需要其中一两个,但我认为它们几乎总是按此顺序排列)。
对于您的数据:
- 唯一已经是变量的列是
unique.id
- 您需要将当前列名拆分为变量和数字
- 然后您需要将 "variable" 变量放回列中
这看起来像:
library(tidyr)
library(dplyr)
df3 %>%
gather(col, value, -unique.id, -intervention) %>%
separate(col, c("variable", "number")) %>%
spread(variable, value, convert = TRUE) %>%
mutate(start = as.Date(start, "1970-01-01"), stop = as.Date(stop, "1970-01-01"))
你的情况比较复杂,因为你有两种类型的变量,所以你需要在最后恢复类型。
最近,melt.data.table
中添加了一项新功能,可以轻松地融入多列。您所要做的就是在 measure.vars
参数中的 list
中单独提供您想要熔化的列。
关注these instructions即可抢到开发版。
require(data.table) ## v1.9.5
setDT(dat) # dat is now a data.table
melt(dat, id = 1L, measure = patterns("^abx", "^start", "^stop"),
value.name = c("abx", "start", "stop"))
# unique.id variable abx start stop
# 1: 1 1 Moxi 2014-01-01 2014-01-07
# 2: 2 1 Moxi 2014-01-01 2014-01-02
# 3: 3 1 Cipro 2014-01-01 2014-01-05
# 4: 4 1 Vanco 2014-01-02 2014-01-03
# 5: 5 1 Vanco 2014-01-01 2014-01-02
# 6: 1 2 PenG 2014-01-01 2014-01-07
# 7: 2 2 Cipro 2014-01-01 2014-01-02
# 8: 3 2 Vanco 2014-01-01 2014-01-05
# 9: 4 2 Cipro 2014-01-02 2014-01-03
# 10: 5 2 PenG 2014-01-01 2014-01-02
# 11: 1 3 Vanco 2014-01-01 2014-01-07
# 12: 2 3 PenG 2014-01-01 2014-01-02
# 13: 3 3 Cipro 2014-01-01 2014-01-05
# 14: 4 3 Cipro 2014-01-02 2014-01-03
# 15: 5 3 PenG 2014-01-01 2014-01-02
# 16: 1 4 Moxi 2014-01-01 2014-01-07
# 17: 2 4 Vanco 2014-01-01 2014-01-02
# 18: 3 4 Vanco 2014-01-01 2014-01-05
# 19: 4 4 PenG 2014-01-02 2014-01-03
# 20: 5 4 Cipro 2014-01-01 2014-01-02
我在这里使用了列号,但您也可以提供列名。
我的数据集如下所示:
unique.id abx.1 start.1 stop.1 abx.2 start.2 stop.2 abx.3 start.3 stop.3 abx.4 start.4
1 1 Moxi 2014-01-01 2014-01-07 PenG 2014-01-01 2014-01-07 Vanco 2014-01-01 2014-01-07 Moxi 2014-01-01
2 2 Moxi 2014-01-01 2014-01-02 Cipro 2014-01-01 2014-01-02 PenG 2014-01-01 2014-01-02 Vanco 2014-01-01
3 3 Cipro 2014-01-01 2014-01-05 Vanco 2014-01-01 2014-01-05 Cipro 2014-01-01 2014-01-05 Vanco 2014-01-01
4 4 Vanco 2014-01-02 2014-01-03 Cipro 2014-01-02 2014-01-03 Cipro 2014-01-02 2014-01-03 PenG 2014-01-02
5 5 Vanco 2014-01-01 2014-01-02 PenG 2014-01-01 2014-01-02 PenG 2014-01-01 2014-01-02 Cipro 2014-01-01
stop.4 intervention
1 2014-01-07 0
2 2014-01-02 0
3 2014-01-05 1
4 2014-01-03 1
5 2014-01-02 0
用一些代码来创建这个:
abxoptions <- c("Cipro", "Moxi", "PenG", "Vanco")
df3 <- data.frame(
unique.id = 1:5,
abx.1 = sample(abxoptions,5, replace=TRUE),
start.1 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
stop.1 = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
abx.2 = sample(abxoptions,5, replace=TRUE),
start.2 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
stop.2 = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
abx.3 = sample(abxoptions,5, replace=TRUE),
start.3 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
stop.3 = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
abx.4 = sample(abxoptions,5, replace=TRUE),
start.4 = as.Date(c('2014-01-01', '2014-01-01', '2014-01-01', '2014-01-02', '2014-01-01')),
stop.4 = as.Date(c('2014-01-07', '2014-01-02', '2014-01-05', '2014-01-03', '2014-01-02')),
intervention = c(0,0,1,1,0)
)
我想将此数据整理成如下所示:
unique.id abx start stop intervention
1 Moxi 2014-01-10 2014-01-07 0
1 Pen G 2014-01-01 2014-01-07 0
1 Vanco 2014-01-01 2014-01-07 0
1 Moxi 2014-01-01 2014-01-07 0 etc etc
以下解决方案无法满足我的需求:
Gather multiple sets of columns 和
我怀疑 Hadley 惊人的 tidyr pakcage 是可行的方法...只是想不通。任何帮助将不胜感激。
您可以尝试 reshape
来自 base R
reshape(df3, direction='long', varying=2:ncol(df3), sep=".")
或使用 splitstackshape
merged.stack
library(splitstackshape)
merged.stack(df3, var.stubs=c('abx', 'start', 'stop'), sep='.')[,
c('start', 'stop') := lapply(.SD, as.Date,
origin='1970-01-01'), .SDcols=4:5][]
几乎所有数据整理问题都可以通过三个步骤解决:
- 收集所有非变量列
- 将 "colname" 列分成多个变量
- 重新传播数据
(通常您只需要其中一两个,但我认为它们几乎总是按此顺序排列)。
对于您的数据:
- 唯一已经是变量的列是
unique.id
- 您需要将当前列名拆分为变量和数字
- 然后您需要将 "variable" 变量放回列中
这看起来像:
library(tidyr)
library(dplyr)
df3 %>%
gather(col, value, -unique.id, -intervention) %>%
separate(col, c("variable", "number")) %>%
spread(variable, value, convert = TRUE) %>%
mutate(start = as.Date(start, "1970-01-01"), stop = as.Date(stop, "1970-01-01"))
你的情况比较复杂,因为你有两种类型的变量,所以你需要在最后恢复类型。
最近,melt.data.table
中添加了一项新功能,可以轻松地融入多列。您所要做的就是在 measure.vars
参数中的 list
中单独提供您想要熔化的列。
关注these instructions即可抢到开发版。
require(data.table) ## v1.9.5
setDT(dat) # dat is now a data.table
melt(dat, id = 1L, measure = patterns("^abx", "^start", "^stop"),
value.name = c("abx", "start", "stop"))
# unique.id variable abx start stop
# 1: 1 1 Moxi 2014-01-01 2014-01-07
# 2: 2 1 Moxi 2014-01-01 2014-01-02
# 3: 3 1 Cipro 2014-01-01 2014-01-05
# 4: 4 1 Vanco 2014-01-02 2014-01-03
# 5: 5 1 Vanco 2014-01-01 2014-01-02
# 6: 1 2 PenG 2014-01-01 2014-01-07
# 7: 2 2 Cipro 2014-01-01 2014-01-02
# 8: 3 2 Vanco 2014-01-01 2014-01-05
# 9: 4 2 Cipro 2014-01-02 2014-01-03
# 10: 5 2 PenG 2014-01-01 2014-01-02
# 11: 1 3 Vanco 2014-01-01 2014-01-07
# 12: 2 3 PenG 2014-01-01 2014-01-02
# 13: 3 3 Cipro 2014-01-01 2014-01-05
# 14: 4 3 Cipro 2014-01-02 2014-01-03
# 15: 5 3 PenG 2014-01-01 2014-01-02
# 16: 1 4 Moxi 2014-01-01 2014-01-07
# 17: 2 4 Vanco 2014-01-01 2014-01-02
# 18: 3 4 Vanco 2014-01-01 2014-01-05
# 19: 4 4 PenG 2014-01-02 2014-01-03
# 20: 5 4 Cipro 2014-01-01 2014-01-02
我在这里使用了列号,但您也可以提供列名。