向数据框添加行以报告所有未随时间变化的值

Question

我有这个数据框：

Votes <- data.frame(
  VoteCreationDate = c(1,3,3,5,5,6),
  GiverId = c(19,19,38,19,38,19),
  CumNumUpVotes = c(1,3,1,7,2,10)
)
Votes


 VoteCreationDate GiverId CumNumUpVotes
                1      19             1
                3      19             3
                3      38             1
                5      19             7
                5      38             2
                6      19            10

对于每个 GiverId（19 和 38），所有可能的日期（从 1 到 6 的数字）都应列在 VoteCreationDate.

中

然后，对于每个GiverId和VoteCreationDate，应该匹配对应的CumNumUpVotes。如果没有相应的值，则 CumNumUpVotes 应该取自紧接在前面的 VoteCreationDate.

例如，VoteCreationDate = 4 和 GiverId = 38 没有对应的 CumNumUpVotes。此单元格应等于 1，即 GiverId = 38 和 VoteCreationDate = 3 中的 CumNumUpVotes。

结尾应该是这样的：

 VoteCreationDate GiverId CumNumUpVotes
                1      19             1
                2      19             1
                3      19             3
                4      19             3
                5      19             7
                6      19            10
                1      38             0
                2      38             0
                3      38             1
                4      38             1
                5      38             2
                6      38             2

知道怎么去吗？

Answer 1

一个dplyr and tidyr解决方案。

library(dplyr)
library(tidyr)

Votes2 <- Votes %>%
  complete(VoteCreationDate = full_seq(VoteCreationDate, period = 1), GiverId) %>%
  arrange(GiverId, VoteCreationDate) %>%
  group_by(GiverId) %>%
  fill(CumNumUpVotes) %>%
  replace_na(list(CumNumUpVotes = 0)) %>%
  ungroup()
Votes2
# # A tibble: 12 x 3
#    VoteCreationDate GiverId CumNumUpVotes
#               <dbl>   <dbl>         <dbl>
#  1             1.00    19.0          1.00
#  2             2.00    19.0          1.00
#  3             3.00    19.0          3.00
#  4             4.00    19.0          3.00
#  5             5.00    19.0          7.00
#  6             6.00    19.0          10.0 
#  7             1.00    38.0             0   
#  8             2.00    38.0             0   
#  9             3.00    38.0          1.00
# 10             4.00    38.0          1.00
# 11             5.00    38.0          2.00
# 12             6.00    38.0          2.00

Answer 2

do.call(rbind, lapply(split(Votes, Votes$GiverId), function(x){
    temp = merge(x, data.frame(VoteCreationDate = 1:6), all = TRUE)
    temp$GiverId = temp$GiverId[!is.na(temp$GiverId)][1]
    temp$CumNumUpVotes = cummax(replace(temp$CumNumUpVotes, is.na(temp$CumNumUpVotes), 0))
    temp
}))
#     VoteCreationDate GiverId CumNumUpVotes
#19.1                1      19             1
#19.2                2      19             1
#19.3                3      19             3
#19.4                4      19             3
#19.5                5      19             7
#19.6                6      19            10
#38.1                1      38             0
#38.2                2      38             0
#38.3                3      38             1
#38.4                4      38             1
#38.5                5      38             2
#38.6                6      38             2

向数据框添加行以报告所有未随时间变化的值

Adding rows to a data frame to report all the values that did not change over time

indexing

r

match

matching