向数据框添加行以报告所有未随时间变化的值
Adding rows to a data frame to report all the values that did not change over time
我有这个数据框:
Votes <- data.frame(
VoteCreationDate = c(1,3,3,5,5,6),
GiverId = c(19,19,38,19,38,19),
CumNumUpVotes = c(1,3,1,7,2,10)
)
Votes
VoteCreationDate GiverId CumNumUpVotes
1 19 1
3 19 3
3 38 1
5 19 7
5 38 2
6 19 10
对于每个 GiverId
(19 和 38),所有可能的日期(从 1 到 6 的数字)都应列在 VoteCreationDate
.
中
然后,对于每个GiverId
和VoteCreationDate
,应该匹配对应的CumNumUpVotes
。如果没有相应的值,则 CumNumUpVotes
应该取自紧接在前面的 VoteCreationDate
.
例如,VoteCreationDate
= 4 和 GiverId
= 38 没有对应的 CumNumUpVotes
。此单元格应等于 1,即 GiverId
= 38 和 VoteCreationDate
= 3 中的 CumNumUpVotes
。
结尾应该是这样的:
VoteCreationDate GiverId CumNumUpVotes
1 19 1
2 19 1
3 19 3
4 19 3
5 19 7
6 19 10
1 38 0
2 38 0
3 38 1
4 38 1
5 38 2
6 38 2
知道怎么去吗?
library(dplyr)
library(tidyr)
Votes2 <- Votes %>%
complete(VoteCreationDate = full_seq(VoteCreationDate, period = 1), GiverId) %>%
arrange(GiverId, VoteCreationDate) %>%
group_by(GiverId) %>%
fill(CumNumUpVotes) %>%
replace_na(list(CumNumUpVotes = 0)) %>%
ungroup()
Votes2
# # A tibble: 12 x 3
# VoteCreationDate GiverId CumNumUpVotes
# <dbl> <dbl> <dbl>
# 1 1.00 19.0 1.00
# 2 2.00 19.0 1.00
# 3 3.00 19.0 3.00
# 4 4.00 19.0 3.00
# 5 5.00 19.0 7.00
# 6 6.00 19.0 10.0
# 7 1.00 38.0 0
# 8 2.00 38.0 0
# 9 3.00 38.0 1.00
# 10 4.00 38.0 1.00
# 11 5.00 38.0 2.00
# 12 6.00 38.0 2.00
do.call(rbind, lapply(split(Votes, Votes$GiverId), function(x){
temp = merge(x, data.frame(VoteCreationDate = 1:6), all = TRUE)
temp$GiverId = temp$GiverId[!is.na(temp$GiverId)][1]
temp$CumNumUpVotes = cummax(replace(temp$CumNumUpVotes, is.na(temp$CumNumUpVotes), 0))
temp
}))
# VoteCreationDate GiverId CumNumUpVotes
#19.1 1 19 1
#19.2 2 19 1
#19.3 3 19 3
#19.4 4 19 3
#19.5 5 19 7
#19.6 6 19 10
#38.1 1 38 0
#38.2 2 38 0
#38.3 3 38 1
#38.4 4 38 1
#38.5 5 38 2
#38.6 6 38 2
我有这个数据框:
Votes <- data.frame(
VoteCreationDate = c(1,3,3,5,5,6),
GiverId = c(19,19,38,19,38,19),
CumNumUpVotes = c(1,3,1,7,2,10)
)
Votes
VoteCreationDate GiverId CumNumUpVotes
1 19 1
3 19 3
3 38 1
5 19 7
5 38 2
6 19 10
对于每个 GiverId
(19 和 38),所有可能的日期(从 1 到 6 的数字)都应列在 VoteCreationDate
.
然后,对于每个GiverId
和VoteCreationDate
,应该匹配对应的CumNumUpVotes
。如果没有相应的值,则 CumNumUpVotes
应该取自紧接在前面的 VoteCreationDate
.
例如,VoteCreationDate
= 4 和 GiverId
= 38 没有对应的 CumNumUpVotes
。此单元格应等于 1,即 GiverId
= 38 和 VoteCreationDate
= 3 中的 CumNumUpVotes
。
结尾应该是这样的:
VoteCreationDate GiverId CumNumUpVotes
1 19 1
2 19 1
3 19 3
4 19 3
5 19 7
6 19 10
1 38 0
2 38 0
3 38 1
4 38 1
5 38 2
6 38 2
知道怎么去吗?
library(dplyr)
library(tidyr)
Votes2 <- Votes %>%
complete(VoteCreationDate = full_seq(VoteCreationDate, period = 1), GiverId) %>%
arrange(GiverId, VoteCreationDate) %>%
group_by(GiverId) %>%
fill(CumNumUpVotes) %>%
replace_na(list(CumNumUpVotes = 0)) %>%
ungroup()
Votes2
# # A tibble: 12 x 3
# VoteCreationDate GiverId CumNumUpVotes
# <dbl> <dbl> <dbl>
# 1 1.00 19.0 1.00
# 2 2.00 19.0 1.00
# 3 3.00 19.0 3.00
# 4 4.00 19.0 3.00
# 5 5.00 19.0 7.00
# 6 6.00 19.0 10.0
# 7 1.00 38.0 0
# 8 2.00 38.0 0
# 9 3.00 38.0 1.00
# 10 4.00 38.0 1.00
# 11 5.00 38.0 2.00
# 12 6.00 38.0 2.00
do.call(rbind, lapply(split(Votes, Votes$GiverId), function(x){
temp = merge(x, data.frame(VoteCreationDate = 1:6), all = TRUE)
temp$GiverId = temp$GiverId[!is.na(temp$GiverId)][1]
temp$CumNumUpVotes = cummax(replace(temp$CumNumUpVotes, is.na(temp$CumNumUpVotes), 0))
temp
}))
# VoteCreationDate GiverId CumNumUpVotes
#19.1 1 19 1
#19.2 2 19 1
#19.3 3 19 3
#19.4 4 19 3
#19.5 5 19 7
#19.6 6 19 10
#38.1 1 38 0
#38.2 2 38 0
#38.3 3 38 1
#38.4 4 38 1
#38.5 5 38 2
#38.6 6 38 2