数据框的复杂重塑,跟踪记录编辑
Complex Reshaping of Data Frame, tracking record edits
我有一个数据框,可以跟踪项目记录何时被编辑、编辑时间、编辑的字段、旧值和新值。 OrderDate
、Probability
和 Total
列显示这些字段今天的值:
df.raw <- data.frame(project=rep(c('A','B'), each=4),
createDate=as.Date(rep(c('2015-01-01','2017-05-01'), each=4)),
editDate=as.Date(c('2018-06-01','2019-04-01','2019-05-01','2019-06-01', '2018-10-01','2018-11-01','2018-11-15','2019-01-01')),
editField=c('OrderDate', 'OrderDate','Probability','Probability', 'Total','Total', 'Probability','Total'),
oldValue=c('2018-06-01','2019-05-01',20,30,500,550,70,400),
newValue=c('2019-05-01','2019-06-01',30,50,550,400,30,450),
OrderDate=as.Date(rep(c('2019-06-01','2019-01-01'), each=4)),
Probability=rep(c(50,70), each=4),
Total=rep(c(10,450), each=4))
project createDate editDate editField oldValue newValue OrderDate Probability Total
1 A 2015-01-01 2018-06-01 OrderDate 2018-06-01 2019-05-01 2019-06-01 50 10
2 A 2015-01-01 2019-04-01 OrderDate 2019-05-01 2019-06-01 2019-06-01 50 10
3 A 2015-01-01 2019-05-01 Probability 20 30 2019-06-01 50 10
4 A 2015-01-01 2019-06-01 Probability 30 50 2019-06-01 50 10
5 B 2017-05-01 2018-10-01 Total 500 550 2019-01-01 70 450
6 B 2017-05-01 2018-11-01 Total 550 400 2019-01-01 70 450
7 B 2017-05-01 2018-11-15 Probability 70 30 2019-01-01 70 450
8 B 2017-05-01 2019-01-01 Total 400 450 2019-01-01 70 450
我想转换此数据框以便:
- 还有一行用于创建项目。
- 每行显示创建或编辑时项目的订单日期、概率和总计。
- 如果某个字段从未被编辑过,则该字段将始终等于该项目的最终
OrderDate
、Probability
或 Total
值。
最终结果看起来像这样:
df.reshaped <- data.frame(project=rep(c('A','B'), each=5),
editDate=as.Date(c('2015-01-01','2018-06-01','2019-04-01','2019-05-01','2019-06-01', '2017-05-01', '2018-10-01','2018-11-01','2018-11-15','2019-01-01')),
editField=c('Created','OrderDate', 'OrderDate','Probability','Probability','Created', 'Total','Total', 'Probability','Total'),
OrderDateAtEdit=as.Date(c('2018-06-01','2019-05-01','2019-06-01','2019-06-01','2019-06-01',rep('2019-01-01', 5))),
ProbabilityAtEdit=c(20,20,20,30,50,70,70,70,30,30),
TotalAtEdit=c(10,10,10,10,10,500,550,400,400,450))
project editDate editField OrderDateAtEdit ProbabilityAtEdit TotalAtEdit
1 A 2015-01-01 Created 2018-06-01 20 10
2 A 2018-06-01 OrderDate 2019-05-01 20 10
3 A 2019-04-01 OrderDate 2019-06-01 20 10
4 A 2019-05-01 Probability 2019-06-01 30 10
5 A 2019-06-01 Probability 2019-06-01 50 10
6 B 2017-05-01 Created 2019-01-01 70 500
7 B 2018-10-01 Total 2019-01-01 70 550
8 B 2018-11-01 Total 2019-01-01 70 400
9 B 2018-11-15 Probability 2019-01-01 30 400
10 B 2019-01-01 Total 2019-01-01 30 450
我不知道从哪里开始,如有任何帮助,我们将不胜感激!谢谢。
我认为数据已合并在一起,您需要将它们拆分为一个事件 table 和一个编辑 table:
library(data.table)
setDT(df.raw)
#created the events table with the available values first
cols <- c("OrderDate", "Probability", "Total")
events <- df.raw[, setnames(rbindlist(.(.(createDate[1L], "Created"),
.(editDate, editField))), c("editDate","editField")), project]
events[unique(df.raw, by=c("project", "Probability", "Total")), on=.(project),
paste0(cols, "AtEdit") := lapply(mget(cols), as.character)]
#historical edits in another table
edits <- df.raw[, .(startDate=c(createDate[1L], editDate),
endDate=c(editDate, as.Date("9999-12-31")),
value=c(oldValue, newValue[.N])), .(project, editField)]
#perform non-equi joins to update events table
for (x in cols) {
cn <- paste0(x, "AtEdit")
v <- edits[editField==x][events, on=.(project, startDate<=editDate, endDate>editDate), value]
events[, (cn) := fifelse(is.na(v), get(cn), as.character(v))]
}
输出:
project editDate editField OrderDateAtEdit ProbabilityAtEdit TotalAtEdit
1: A 2015-01-01 Created 2018-06-01 20 10
2: A 2018-06-01 OrderDate 2019-05-01 20 10
3: A 2019-04-01 OrderDate 2019-06-01 20 10
4: A 2019-05-01 Probability 2019-06-01 30 10
5: A 2019-06-01 Probability 2019-06-01 50 10
6: B 2017-05-01 Created 2019-01-01 70 500
7: B 2018-10-01 Total 2019-01-01 70 550
8: B 2018-11-01 Total 2019-01-01 70 400
9: B 2018-11-15 Probability 2019-01-01 30 400
10: B 2019-01-01 Total 2019-01-01 30 450
我有一个数据框,可以跟踪项目记录何时被编辑、编辑时间、编辑的字段、旧值和新值。 OrderDate
、Probability
和 Total
列显示这些字段今天的值:
df.raw <- data.frame(project=rep(c('A','B'), each=4),
createDate=as.Date(rep(c('2015-01-01','2017-05-01'), each=4)),
editDate=as.Date(c('2018-06-01','2019-04-01','2019-05-01','2019-06-01', '2018-10-01','2018-11-01','2018-11-15','2019-01-01')),
editField=c('OrderDate', 'OrderDate','Probability','Probability', 'Total','Total', 'Probability','Total'),
oldValue=c('2018-06-01','2019-05-01',20,30,500,550,70,400),
newValue=c('2019-05-01','2019-06-01',30,50,550,400,30,450),
OrderDate=as.Date(rep(c('2019-06-01','2019-01-01'), each=4)),
Probability=rep(c(50,70), each=4),
Total=rep(c(10,450), each=4))
project createDate editDate editField oldValue newValue OrderDate Probability Total
1 A 2015-01-01 2018-06-01 OrderDate 2018-06-01 2019-05-01 2019-06-01 50 10
2 A 2015-01-01 2019-04-01 OrderDate 2019-05-01 2019-06-01 2019-06-01 50 10
3 A 2015-01-01 2019-05-01 Probability 20 30 2019-06-01 50 10
4 A 2015-01-01 2019-06-01 Probability 30 50 2019-06-01 50 10
5 B 2017-05-01 2018-10-01 Total 500 550 2019-01-01 70 450
6 B 2017-05-01 2018-11-01 Total 550 400 2019-01-01 70 450
7 B 2017-05-01 2018-11-15 Probability 70 30 2019-01-01 70 450
8 B 2017-05-01 2019-01-01 Total 400 450 2019-01-01 70 450
我想转换此数据框以便:
- 还有一行用于创建项目。
- 每行显示创建或编辑时项目的订单日期、概率和总计。
- 如果某个字段从未被编辑过,则该字段将始终等于该项目的最终
OrderDate
、Probability
或Total
值。
最终结果看起来像这样:
df.reshaped <- data.frame(project=rep(c('A','B'), each=5),
editDate=as.Date(c('2015-01-01','2018-06-01','2019-04-01','2019-05-01','2019-06-01', '2017-05-01', '2018-10-01','2018-11-01','2018-11-15','2019-01-01')),
editField=c('Created','OrderDate', 'OrderDate','Probability','Probability','Created', 'Total','Total', 'Probability','Total'),
OrderDateAtEdit=as.Date(c('2018-06-01','2019-05-01','2019-06-01','2019-06-01','2019-06-01',rep('2019-01-01', 5))),
ProbabilityAtEdit=c(20,20,20,30,50,70,70,70,30,30),
TotalAtEdit=c(10,10,10,10,10,500,550,400,400,450))
project editDate editField OrderDateAtEdit ProbabilityAtEdit TotalAtEdit
1 A 2015-01-01 Created 2018-06-01 20 10
2 A 2018-06-01 OrderDate 2019-05-01 20 10
3 A 2019-04-01 OrderDate 2019-06-01 20 10
4 A 2019-05-01 Probability 2019-06-01 30 10
5 A 2019-06-01 Probability 2019-06-01 50 10
6 B 2017-05-01 Created 2019-01-01 70 500
7 B 2018-10-01 Total 2019-01-01 70 550
8 B 2018-11-01 Total 2019-01-01 70 400
9 B 2018-11-15 Probability 2019-01-01 30 400
10 B 2019-01-01 Total 2019-01-01 30 450
我不知道从哪里开始,如有任何帮助,我们将不胜感激!谢谢。
我认为数据已合并在一起,您需要将它们拆分为一个事件 table 和一个编辑 table:
library(data.table)
setDT(df.raw)
#created the events table with the available values first
cols <- c("OrderDate", "Probability", "Total")
events <- df.raw[, setnames(rbindlist(.(.(createDate[1L], "Created"),
.(editDate, editField))), c("editDate","editField")), project]
events[unique(df.raw, by=c("project", "Probability", "Total")), on=.(project),
paste0(cols, "AtEdit") := lapply(mget(cols), as.character)]
#historical edits in another table
edits <- df.raw[, .(startDate=c(createDate[1L], editDate),
endDate=c(editDate, as.Date("9999-12-31")),
value=c(oldValue, newValue[.N])), .(project, editField)]
#perform non-equi joins to update events table
for (x in cols) {
cn <- paste0(x, "AtEdit")
v <- edits[editField==x][events, on=.(project, startDate<=editDate, endDate>editDate), value]
events[, (cn) := fifelse(is.na(v), get(cn), as.character(v))]
}
输出:
project editDate editField OrderDateAtEdit ProbabilityAtEdit TotalAtEdit
1: A 2015-01-01 Created 2018-06-01 20 10
2: A 2018-06-01 OrderDate 2019-05-01 20 10
3: A 2019-04-01 OrderDate 2019-06-01 20 10
4: A 2019-05-01 Probability 2019-06-01 30 10
5: A 2019-06-01 Probability 2019-06-01 50 10
6: B 2017-05-01 Created 2019-01-01 70 500
7: B 2018-10-01 Total 2019-01-01 70 550
8: B 2018-11-01 Total 2019-01-01 70 400
9: B 2018-11-15 Probability 2019-01-01 30 400
10: B 2019-01-01 Total 2019-01-01 30 450