r 保留最近的(id)和 NAs 的归因

r retain the recent(id) and impute for NAs

我有一个数据集,其中每一行代表一个学生的回答。每列代表一个教师评价问题。

   StudentId     Q1    Q2   Q3   Q4    SystemTime
   1             NA    5    2    NA    09:01:07.2123
   2             1     4    4    NA    09:03:01.3145
   2             NA    4    4    1     09:03:02.6145
   3             1     3    NA   2     09:47:17.6541
   3             1     NA   NA   5     10:01:17.2343
   3             3     NA   1    NA    10:12:01.3435
   4             NA    NA   1    2     12:07:13.1187

我的目标是 1) 保留我正在做的最新学生反应

df %>% 
  group_by(StudentId) %>%
  slice(which.max(hms(df $SystemTime)))

   StudentId     Q1    Q2   Q3   Q4    SystemTime
   1             NA    5    2    NA    09:01:07.2123
   2             NA    4    4    1     09:03:02.6145
   3             3     NA   1    NA    10:12:01.3435
   4             NA    NA   1    2     12:07:13.1187

我还想根据该学生 (StudentID) 之前的回复来估算最近回复中缺失的数据。最终的预期结果如下图

  StudentId      Q1    Q2   Q3   Q4    SystemTime
   1             NA    5    2    NA    09:01:07.2123
   2              1    4    4    1     09:03:02.6145
   3              3    3    1    5     10:12:01.3435
   4             NA    NA   1    2     12:07:13.1187

非常感谢任何建议。

首先 fill NA 组值,然后 select 行具有最新值。

library(dplyr)
library(tidyr)

df %>% 
  group_by(StudentId) %>%
  fill(starts_with('Q')) %>%
  slice(which.max(as.POSIXct(SystemTime, format = '%H:%M:%S')))


#  StudentId    Q1    Q2    Q3    Q4 SystemTime   
#      <int> <int> <int> <int> <int> <chr>        
#1         1    NA     5     2    NA 09:01:07.2123
#2         2     1     4     4     1 09:03:02.6145
#3         3     3     3     1     5 10:12:01.3435
#4         4    NA    NA     1     2 12:07:13.1187

数据

df <- structure(list(StudentId = c(1L, 2L, 2L, 3L, 3L, 3L, 4L), Q1 = c(NA, 
1L, NA, 1L, 1L, 3L, NA), Q2 = c(5L, 4L, 4L, 3L, NA, NA, NA), 
Q3 = c(2L, 4L, 4L, NA, NA, 1L, 1L), Q4 = c(NA, NA, 1L, 2L, 
5L, NA, 2L), SystemTime = c("09:01:07.2123", "09:03:01.3145", 
"09:03:02.6145", "09:47:17.6541", "10:01:17.2343", "10:12:01.3435", 
"12:07:13.1187")), class = "data.frame", row.names = c(NA, -7L))

此答案未对列名做出任何假设。

df = read_csv("StudentId,Q1,Q2,Q3,Q4,SystemTime
1,,5,2,,09:01:07.2123
2,1,4,4,,09:03:01.3145
2,,4,4,1,09:03:02.6145
3,1,3,,2,09:47:17.6541
3,1,,,5,10:01:17.2343
3,3,,1,,10:12:01.3435
4,,,1,2,12:07:13.1187")


# A tibble: 7 x 6
  StudentId    Q1    Q2    Q3    Q4 SystemTime
      <dbl> <dbl> <dbl> <dbl> <dbl> <time>    
1         1    NA     5     2    NA 09:01:07  
2         2     1     4     4    NA 09:03:01  
3         2    NA     4     4     1 09:03:02  
4         3     1     3    NA     2 09:47:17  
5         3     1    NA    NA     5 10:01:17  
6         3     3    NA     1    NA 10:12:01  
7         4    NA    NA     1     2 12:07:13  

使用group_by

df %>% group_by(StudentId) %>% 
  arrange(SystemTime) %>%
  summarise_all(~ last(na.omit(.)))


# A tibble: 4 x 6
  StudentId    Q1    Q2    Q3    Q4 SystemTime
      <dbl> <dbl> <dbl> <dbl> <dbl> <time>    
1         1    NA     5     2    NA 09:01:07  
2         2     1     4     4     1 09:03:02  
3         3     3     3     1     5 10:12:01  
4         4    NA    NA     1     2 12:07:13