如何在 R 中重塑之前扩展数据
How to expand data before reshape in R
我有一个如下所示的数据框:
as.is <- data.frame(Project = c('Proj A', 'Proj B', 'Proj C', 'Proj D'),
Start.Date = c('16.02.2015', '02.03.2015', '16.02.2015', '09.03.2015'),
Duration = c(3, 2, 2, 4),
No.Of.Resources = c(3, 5, 2, 6))
我需要更改格式,使其看起来像这样:
to.be <- data.frame(Project = c('Proj A', 'Proj B', 'Proj C', 'Proj D'),
'16.02.2015' = c(3, NA, 2, NA),
'23.02.2015' = c(3, NA, 2, NA),
'02.03.2015' = c(3, 5, NA, NA),
'09.03.2015' = c(NA, 5, NA, 6),
'16.03.2015' = c(NA, NA, NA, 6),
'23.03.2015' = c(NA, NA, NA, 6),
'30.03.2015' = c(NA, NA, NA, 6))
我不知道如何扩展日期,所以我每行一个,这样我就可以对数据使用 reshape2。我可以得到一份我想成为我的标题的日期列表,但看不到如何将各个部分放在一起。
解决这个问题的正确方法是什么?
编辑:澄清一下,持续时间是周数,所以我需要生成标题为 x、x+7、x+14 的列...
我建议使用 tidyr
包而不是 reshape2
。虽然 tidyr
导入 reshape2
来做一些操作,但我相信它应该被考虑为它的继任者。
# Convert to Date class to sort the columns correctly
as.is$Start.Date <- as.Date(as.character(as.is$Start.Date), "%d.%m.%Y")
intermediate <- with(as.is, data.frame(
Project = rep(Project, Duration),
Date = rep(Start.Date, Duration) +
7*(unlist(lapply(Duration, seq_len))-1),
No.Of.Resources = rep(No.Of.Resources, Duration)
))
require(tidyr)
result <- spread(intermediate, Date, No.Of.Resources)
查看你得到的结果
Project 2015-02-16 2015-02-23 2015-03-02 2015-03-09 2015-03-16 2015-03-23
1 Proj A 3 3 3 NA NA NA
2 Proj B NA NA 5 5 NA NA
3 Proj C 2 2 NA NA NA NA
4 Proj D NA NA NA 6 6 6
2015-03-30
1 NA
2 NA
3 NA
4 6
对其调用 dput(result)
会产生您所要求的结果
structure(list(
Project = structure(1:4, .Label = c("Proj A", "Proj B", "Proj C", "Proj D"), class = "factor"),
`2015-02-16` = c(3, NA, 2, NA),
`2015-02-23` = c(3, NA, 2, NA),
`2015-03-02` = c(3, 5, NA, NA),
`2015-03-09` = c(NA, 5, NA, 6),
`2015-03-16` = c(NA, NA, NA, 6),
`2015-03-23` = c(NA, NA, NA, 6),
`2015-03-30` = c(NA, NA, NA, 6)),
.Names = c("Project", "2015-02-16", "2015-02-23", "2015-03-02", "2015-03-09", "2015-03-16", "2015-03-23", "2015-03-30"),
class = "data.frame", row.names = c(NA, 4L))
这是一种似乎有效的方法。它使用我的 "splitstackshape" 包中的 expandRows
和 getanID
,然后使用 "data.table" 中的 dcast.data.table
将值扩展为宽形式:
as.is$Start.Date <- as.Date(as.character(as.is$Start.Date), "%d.%m.%Y")
library(splitstackshape)
dcast.data.table(
getanID(
expandRows(as.is, "Duration"),
c("Project", "Start.Date"))[
, Start.Date := Start.Date + (.id-1) * 7],
Project ~ Start.Date, value.var = "No.Of.Resources")
# Project 2015-02-16 2015-02-23 2015-03-02 2015-03-09 2015-03-16 2015-03-23 2015-03-30
# 1: Proj A 3 3 3 NA NA NA NA
# 2: Proj B NA NA 5 5 NA NA NA
# 3: Proj C 2 2 NA NA NA NA NA
# 4: Proj D NA NA NA 6 6 6 6
在这种情况下,"dplyr" 确实有助于更好地阅读解决方案:
library(splitstackshape)
library(dplyr)
library(tidyr)
as.is$Start.Date <- as.Date(as.character(as.is$Start.Date), "%d.%m.%Y")
expandRows(as.is, "Duration") %>% # expand the data
getanID(c("Project", "Start.Date")) %>% # add an "id" column
mutate(Start.Date = Start.Date + (.id-1) * 7) %>% # recalculate start dates
select(-.id) %>% # drop the "id" column
spread(Start.Date, No.Of.Resources) # reshape long to wide
我会在 data.table
中以不同的方式执行此操作。更新了新的解决方案:
library(data.table)
dt = as.data.table(as.is)
dt[, Start.Date := as.Date(Start.Date, '%d.%m.%Y')]
# use dcast.data.table before version 1.9.5
dcast(dt[, list(seq(Start.Date, length.out = Duration, by = '1 week'), No.Of.Resources)
, by = Project], Project ~ V1)
旧的(不必要的复杂)解决方案:
# expand out Start.Date by Project
dates.all = dt[, seq(Start.Date, length.out = Duration, by = '1 week'), by = Project]
# set the key and do a rolling join, then dcast
# (can use just dcast in version 1.9.5+, have to use dcast.data.table before that)
setkey(dt, Project, Start.Date)
dcast(dt[dates.all, roll = TRUE], Project ~ Start.Date)
# Project 2015-02-16 2015-02-23 2015-03-02 2015-03-09 2015-03-16 2015-03-23 2015-03-30
#1: Proj A 3 3 3 NA NA NA NA
#2: Proj B NA NA 5 5 NA NA NA
#3: Proj C 2 2 NA NA NA NA NA
#4: Proj D NA NA NA 6 6 6 6
我有一个如下所示的数据框:
as.is <- data.frame(Project = c('Proj A', 'Proj B', 'Proj C', 'Proj D'),
Start.Date = c('16.02.2015', '02.03.2015', '16.02.2015', '09.03.2015'),
Duration = c(3, 2, 2, 4),
No.Of.Resources = c(3, 5, 2, 6))
我需要更改格式,使其看起来像这样:
to.be <- data.frame(Project = c('Proj A', 'Proj B', 'Proj C', 'Proj D'),
'16.02.2015' = c(3, NA, 2, NA),
'23.02.2015' = c(3, NA, 2, NA),
'02.03.2015' = c(3, 5, NA, NA),
'09.03.2015' = c(NA, 5, NA, 6),
'16.03.2015' = c(NA, NA, NA, 6),
'23.03.2015' = c(NA, NA, NA, 6),
'30.03.2015' = c(NA, NA, NA, 6))
我不知道如何扩展日期,所以我每行一个,这样我就可以对数据使用 reshape2。我可以得到一份我想成为我的标题的日期列表,但看不到如何将各个部分放在一起。
解决这个问题的正确方法是什么?
编辑:澄清一下,持续时间是周数,所以我需要生成标题为 x、x+7、x+14 的列...
我建议使用 tidyr
包而不是 reshape2
。虽然 tidyr
导入 reshape2
来做一些操作,但我相信它应该被考虑为它的继任者。
# Convert to Date class to sort the columns correctly
as.is$Start.Date <- as.Date(as.character(as.is$Start.Date), "%d.%m.%Y")
intermediate <- with(as.is, data.frame(
Project = rep(Project, Duration),
Date = rep(Start.Date, Duration) +
7*(unlist(lapply(Duration, seq_len))-1),
No.Of.Resources = rep(No.Of.Resources, Duration)
))
require(tidyr)
result <- spread(intermediate, Date, No.Of.Resources)
查看你得到的结果
Project 2015-02-16 2015-02-23 2015-03-02 2015-03-09 2015-03-16 2015-03-23
1 Proj A 3 3 3 NA NA NA
2 Proj B NA NA 5 5 NA NA
3 Proj C 2 2 NA NA NA NA
4 Proj D NA NA NA 6 6 6
2015-03-30
1 NA
2 NA
3 NA
4 6
对其调用 dput(result)
会产生您所要求的结果
structure(list(
Project = structure(1:4, .Label = c("Proj A", "Proj B", "Proj C", "Proj D"), class = "factor"),
`2015-02-16` = c(3, NA, 2, NA),
`2015-02-23` = c(3, NA, 2, NA),
`2015-03-02` = c(3, 5, NA, NA),
`2015-03-09` = c(NA, 5, NA, 6),
`2015-03-16` = c(NA, NA, NA, 6),
`2015-03-23` = c(NA, NA, NA, 6),
`2015-03-30` = c(NA, NA, NA, 6)),
.Names = c("Project", "2015-02-16", "2015-02-23", "2015-03-02", "2015-03-09", "2015-03-16", "2015-03-23", "2015-03-30"),
class = "data.frame", row.names = c(NA, 4L))
这是一种似乎有效的方法。它使用我的 "splitstackshape" 包中的 expandRows
和 getanID
,然后使用 "data.table" 中的 dcast.data.table
将值扩展为宽形式:
as.is$Start.Date <- as.Date(as.character(as.is$Start.Date), "%d.%m.%Y")
library(splitstackshape)
dcast.data.table(
getanID(
expandRows(as.is, "Duration"),
c("Project", "Start.Date"))[
, Start.Date := Start.Date + (.id-1) * 7],
Project ~ Start.Date, value.var = "No.Of.Resources")
# Project 2015-02-16 2015-02-23 2015-03-02 2015-03-09 2015-03-16 2015-03-23 2015-03-30
# 1: Proj A 3 3 3 NA NA NA NA
# 2: Proj B NA NA 5 5 NA NA NA
# 3: Proj C 2 2 NA NA NA NA NA
# 4: Proj D NA NA NA 6 6 6 6
在这种情况下,"dplyr" 确实有助于更好地阅读解决方案:
library(splitstackshape)
library(dplyr)
library(tidyr)
as.is$Start.Date <- as.Date(as.character(as.is$Start.Date), "%d.%m.%Y")
expandRows(as.is, "Duration") %>% # expand the data
getanID(c("Project", "Start.Date")) %>% # add an "id" column
mutate(Start.Date = Start.Date + (.id-1) * 7) %>% # recalculate start dates
select(-.id) %>% # drop the "id" column
spread(Start.Date, No.Of.Resources) # reshape long to wide
我会在 data.table
中以不同的方式执行此操作。更新了新的解决方案:
library(data.table)
dt = as.data.table(as.is)
dt[, Start.Date := as.Date(Start.Date, '%d.%m.%Y')]
# use dcast.data.table before version 1.9.5
dcast(dt[, list(seq(Start.Date, length.out = Duration, by = '1 week'), No.Of.Resources)
, by = Project], Project ~ V1)
旧的(不必要的复杂)解决方案:
# expand out Start.Date by Project
dates.all = dt[, seq(Start.Date, length.out = Duration, by = '1 week'), by = Project]
# set the key and do a rolling join, then dcast
# (can use just dcast in version 1.9.5+, have to use dcast.data.table before that)
setkey(dt, Project, Start.Date)
dcast(dt[dates.all, roll = TRUE], Project ~ Start.Date)
# Project 2015-02-16 2015-02-23 2015-03-02 2015-03-09 2015-03-16 2015-03-23 2015-03-30
#1: Proj A 3 3 3 NA NA NA NA
#2: Proj B NA NA 5 5 NA NA NA
#3: Proj C 2 2 NA NA NA NA NA
#4: Proj D NA NA NA 6 6 6 6