r 保留最近的(id)和 NAs 的归因
r retain the recent(id) and impute for NAs
我有一个数据集,其中每一行代表一个学生的回答。每列代表一个教师评价问题。
StudentId Q1 Q2 Q3 Q4 SystemTime
1 NA 5 2 NA 09:01:07.2123
2 1 4 4 NA 09:03:01.3145
2 NA 4 4 1 09:03:02.6145
3 1 3 NA 2 09:47:17.6541
3 1 NA NA 5 10:01:17.2343
3 3 NA 1 NA 10:12:01.3435
4 NA NA 1 2 12:07:13.1187
我的目标是 1) 保留我正在做的最新学生反应
df %>%
group_by(StudentId) %>%
slice(which.max(hms(df $SystemTime)))
StudentId Q1 Q2 Q3 Q4 SystemTime
1 NA 5 2 NA 09:01:07.2123
2 NA 4 4 1 09:03:02.6145
3 3 NA 1 NA 10:12:01.3435
4 NA NA 1 2 12:07:13.1187
我还想根据该学生 (StudentID) 之前的回复来估算最近回复中缺失的数据。最终的预期结果如下图
StudentId Q1 Q2 Q3 Q4 SystemTime
1 NA 5 2 NA 09:01:07.2123
2 1 4 4 1 09:03:02.6145
3 3 3 1 5 10:12:01.3435
4 NA NA 1 2 12:07:13.1187
非常感谢任何建议。
首先 fill
NA
组值,然后 select 行具有最新值。
library(dplyr)
library(tidyr)
df %>%
group_by(StudentId) %>%
fill(starts_with('Q')) %>%
slice(which.max(as.POSIXct(SystemTime, format = '%H:%M:%S')))
# StudentId Q1 Q2 Q3 Q4 SystemTime
# <int> <int> <int> <int> <int> <chr>
#1 1 NA 5 2 NA 09:01:07.2123
#2 2 1 4 4 1 09:03:02.6145
#3 3 3 3 1 5 10:12:01.3435
#4 4 NA NA 1 2 12:07:13.1187
数据
df <- structure(list(StudentId = c(1L, 2L, 2L, 3L, 3L, 3L, 4L), Q1 = c(NA,
1L, NA, 1L, 1L, 3L, NA), Q2 = c(5L, 4L, 4L, 3L, NA, NA, NA),
Q3 = c(2L, 4L, 4L, NA, NA, 1L, 1L), Q4 = c(NA, NA, 1L, 2L,
5L, NA, 2L), SystemTime = c("09:01:07.2123", "09:03:01.3145",
"09:03:02.6145", "09:47:17.6541", "10:01:17.2343", "10:12:01.3435",
"12:07:13.1187")), class = "data.frame", row.names = c(NA, -7L))
此答案未对列名做出任何假设。
df = read_csv("StudentId,Q1,Q2,Q3,Q4,SystemTime
1,,5,2,,09:01:07.2123
2,1,4,4,,09:03:01.3145
2,,4,4,1,09:03:02.6145
3,1,3,,2,09:47:17.6541
3,1,,,5,10:01:17.2343
3,3,,1,,10:12:01.3435
4,,,1,2,12:07:13.1187")
# A tibble: 7 x 6
StudentId Q1 Q2 Q3 Q4 SystemTime
<dbl> <dbl> <dbl> <dbl> <dbl> <time>
1 1 NA 5 2 NA 09:01:07
2 2 1 4 4 NA 09:03:01
3 2 NA 4 4 1 09:03:02
4 3 1 3 NA 2 09:47:17
5 3 1 NA NA 5 10:01:17
6 3 3 NA 1 NA 10:12:01
7 4 NA NA 1 2 12:07:13
使用group_by
df %>% group_by(StudentId) %>%
arrange(SystemTime) %>%
summarise_all(~ last(na.omit(.)))
# A tibble: 4 x 6
StudentId Q1 Q2 Q3 Q4 SystemTime
<dbl> <dbl> <dbl> <dbl> <dbl> <time>
1 1 NA 5 2 NA 09:01:07
2 2 1 4 4 1 09:03:02
3 3 3 3 1 5 10:12:01
4 4 NA NA 1 2 12:07:13
我有一个数据集,其中每一行代表一个学生的回答。每列代表一个教师评价问题。
StudentId Q1 Q2 Q3 Q4 SystemTime
1 NA 5 2 NA 09:01:07.2123
2 1 4 4 NA 09:03:01.3145
2 NA 4 4 1 09:03:02.6145
3 1 3 NA 2 09:47:17.6541
3 1 NA NA 5 10:01:17.2343
3 3 NA 1 NA 10:12:01.3435
4 NA NA 1 2 12:07:13.1187
我的目标是 1) 保留我正在做的最新学生反应
df %>%
group_by(StudentId) %>%
slice(which.max(hms(df $SystemTime)))
StudentId Q1 Q2 Q3 Q4 SystemTime
1 NA 5 2 NA 09:01:07.2123
2 NA 4 4 1 09:03:02.6145
3 3 NA 1 NA 10:12:01.3435
4 NA NA 1 2 12:07:13.1187
我还想根据该学生 (StudentID) 之前的回复来估算最近回复中缺失的数据。最终的预期结果如下图
StudentId Q1 Q2 Q3 Q4 SystemTime
1 NA 5 2 NA 09:01:07.2123
2 1 4 4 1 09:03:02.6145
3 3 3 1 5 10:12:01.3435
4 NA NA 1 2 12:07:13.1187
非常感谢任何建议。
首先 fill
NA
组值,然后 select 行具有最新值。
library(dplyr)
library(tidyr)
df %>%
group_by(StudentId) %>%
fill(starts_with('Q')) %>%
slice(which.max(as.POSIXct(SystemTime, format = '%H:%M:%S')))
# StudentId Q1 Q2 Q3 Q4 SystemTime
# <int> <int> <int> <int> <int> <chr>
#1 1 NA 5 2 NA 09:01:07.2123
#2 2 1 4 4 1 09:03:02.6145
#3 3 3 3 1 5 10:12:01.3435
#4 4 NA NA 1 2 12:07:13.1187
数据
df <- structure(list(StudentId = c(1L, 2L, 2L, 3L, 3L, 3L, 4L), Q1 = c(NA,
1L, NA, 1L, 1L, 3L, NA), Q2 = c(5L, 4L, 4L, 3L, NA, NA, NA),
Q3 = c(2L, 4L, 4L, NA, NA, 1L, 1L), Q4 = c(NA, NA, 1L, 2L,
5L, NA, 2L), SystemTime = c("09:01:07.2123", "09:03:01.3145",
"09:03:02.6145", "09:47:17.6541", "10:01:17.2343", "10:12:01.3435",
"12:07:13.1187")), class = "data.frame", row.names = c(NA, -7L))
此答案未对列名做出任何假设。
df = read_csv("StudentId,Q1,Q2,Q3,Q4,SystemTime
1,,5,2,,09:01:07.2123
2,1,4,4,,09:03:01.3145
2,,4,4,1,09:03:02.6145
3,1,3,,2,09:47:17.6541
3,1,,,5,10:01:17.2343
3,3,,1,,10:12:01.3435
4,,,1,2,12:07:13.1187")
# A tibble: 7 x 6
StudentId Q1 Q2 Q3 Q4 SystemTime
<dbl> <dbl> <dbl> <dbl> <dbl> <time>
1 1 NA 5 2 NA 09:01:07
2 2 1 4 4 NA 09:03:01
3 2 NA 4 4 1 09:03:02
4 3 1 3 NA 2 09:47:17
5 3 1 NA NA 5 10:01:17
6 3 3 NA 1 NA 10:12:01
7 4 NA NA 1 2 12:07:13
使用group_by
df %>% group_by(StudentId) %>%
arrange(SystemTime) %>%
summarise_all(~ last(na.omit(.)))
# A tibble: 4 x 6
StudentId Q1 Q2 Q3 Q4 SystemTime
<dbl> <dbl> <dbl> <dbl> <dbl> <time>
1 1 NA 5 2 NA 09:01:07
2 2 1 4 4 1 09:03:02
3 3 3 3 1 5 10:12:01
4 4 NA NA 1 2 12:07:13