缺失值和行
Missing Values and Rows
如果这是一个重复的问题,我深表歉意,我似乎找不到类似的东西。
我有一些正在清理的数据,我需要填充缺失值。数据看起来像这样,下面是 dput。打印中删除了小数,但包含在 dput 中。
> print(tbl_df(df), n=26)
# A tibble: 26 x 6
Year Trial Group1 Group2 Group3 Group4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Year1 2 346588. 156266 34806. NA
2 Year1 3 342573 NA 34652. 292001.
3 Year1 5 286285. 129257. 29645. 252786.
4 Year1 7 234410. NA 24536. NA
5 Year1 9 184733. 82944. NA 170653
6 Year1 10 NA 81419. 19461 167273.
7 Year1 11 169620. 74688. 18065 155442
8 Year1 14 107652 48381. 11941. 100076
9 Year1 15 88440 39807 10123. 83137
10 Year1 17 NA 31608 7926 64551.
11 Year1 18 63622 29236 7444. 58848.
12 Year1 22 14143. 6366. 1683. 10889.
13 Year2 22 279904 102271 28221. 138804.
14 Year2 25 200386 78628. 21942 NA
15 Year2 26 157182. NA 18099. 91963.
16 Year2 28 121122. 54538 14532. 76422
17 Year2 30 25899. 16773 489. NA
18 Year2 32 112091. 51219. 11298. 71655.
19 Year2 33 108756 49311. 10589. 70167
20 Year2 34 NA 49127. NA 69195.
21 Year2 36 104827 42651. 8568. 63580.
22 Year2 38 44849 14114 2302. 11652
23 Year2 40 104407. 42545 6240 63318.
24 Year2 41 99059. 38423 6766. 58017
25 Year2 42 NA 40432. NA 57932.
26 Year2 44 49119. 8796. 4769. 11233.
dput(df)
structure(list(Year = c("Year1", "Year1", "Year1", "Year1", "Year1",
"Year1", "Year1", "Year1", "Year1", "Year1", "Year1", "Year1",
"Year2", "Year2", "Year2", "Year2", "Year2", "Year2", "Year2",
"Year2", "Year2", "Year2", "Year2", "Year2", "Year2", "Year2"
), Trial = c(2, 3, 5, 7, 9, 10, 11, 14, 15, 17, 18, 22, 22, 25,
26, 28, 30, 32, 33, 34, 36, 38, 40, 41, 42, 44), Group1 = c(346587.6667,
342573, 286285.3333, 234409.6667, 184733.3333, NA, 169620.3333,
107652, 88440, NA, 63622, 14143.33333, 279904, 200386, 157182.3333,
121122.3333, 25899.33333, 112090.6667, 108756, NA, 104827, 44849,
104407.3333, 99058.66667, NA, 49119.33333), Group2 = c(156266,
NA, 129257.3333, NA, 82943.66667, 81419.33333, 74688.33333, 48381.33333,
39807, 31608, 29236, 6365.666667, 102271, 78628.33333, NA, 54538,
16773, 51218.66667, 49311.33333, 49127.33333, 42650.66667, 14114,
42545, 38423, 40432.33333, 8795.666667), Group3 = c(34805.66667,
34651.66667, 29644.66667, 24535.66667, NA, 19461, 18065, 11941.33333,
10123.33333, 7926, 7444.333333, 1683.333333, 28221.33333, 21942,
18099.33333, 14532.33333, 489.3333333, 11297.66667, 10588.66667,
NA, 8567.666667, 2302.333333, 6240, 6765.666667, NA, 4769.333333
), Group4 = c(NA, 292000.6667, 252785.6667, NA, 170653, 167273.3333,
155442, 100076, 83137, 64551.33333, 58847.66667, 10888.66667,
138803.6667, NA, 91963.33333, 76422, NA, 71655.33333, 70167,
69195.33333, 63579.66667, 11652, 63317.66667, 58017, 57932.33333,
11232.66667)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -26L), spec = structure(list(cols = list(
Year = structure(list(), class = c("collector_character",
"collector")), Trial = structure(list(), class = c("collector_double",
"collector")), Group1 = structure(list(), class = c("collector_double",
"collector")), Group2 = structure(list(), class = c("collector_double",
"collector")), Group3 = structure(list(), class = c("collector_double",
"collector")), Group4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
基本上,我需要用之前的试验(试验按降序排列)填充 na 值。例如,我需要用第 6 行第 4 列的数据填充第 6 行第 3 列。
但这还不是全部。我需要为缺少试验的日子创建一行,然后用最后一次试验填充这些行。这就是我被挂断的事情。有没有办法同时实现这两个目标?
例如,我需要将 tail(df) 从 A 更改为 B。
A.
Year Trial Group1 Group2 Group3 Group4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Year2 40 104407. 42545 6240 63318.
2 Year2 41 99059. 38423 6766. 58017
3 Year2 42 NA 40432. NA 57932.
4 Year2 44 49119. 8796. 4769. 11233.
B.
Year Trial Group1 Group2 Group3 Group4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Year2 40 104407. 42545 6240 63318.
2 Year2 41 99059. 38423 6766. 58017
3 Year2 42 49119. 40432. 4769. 57932.
4 Year2 43 49119. 40432. 4769. 57932.
5 Year2 44 49119. 8796. 4769. 11233.
您可以将 complete
和 fill
与 .direction = 'up'
一起使用
library(dplyr)
library(tidyr)
df %>%
group_by(Year) %>%
complete(Trial = min(Trial):max(Trial)) %>%
fill(starts_with('Group'), .direction = 'up') %>%
ungroup
# A tibble: 44 x 6
# Year Trial Group1 Group2 Group3 Group4
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Year1 2 346588. 156266 34806. 292001.
# 2 Year1 3 342573 129257. 34652. 292001.
# 3 Year1 4 286285. 129257. 29645. 252786.
# 4 Year1 5 286285. 129257. 29645. 252786.
# 5 Year1 6 234410. 82944. 24536. 170653
# 6 Year1 7 234410. 82944. 24536. 170653
# 7 Year1 8 184733. 82944. 19461 170653
# 8 Year1 9 184733. 82944. 19461 170653
# 9 Year1 10 169620. 81419. 19461 167273.
#10 Year1 11 169620. 74688. 18065 155442
# … with 34 more rows
如果这是一个重复的问题,我深表歉意,我似乎找不到类似的东西。
我有一些正在清理的数据,我需要填充缺失值。数据看起来像这样,下面是 dput。打印中删除了小数,但包含在 dput 中。
> print(tbl_df(df), n=26)
# A tibble: 26 x 6
Year Trial Group1 Group2 Group3 Group4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Year1 2 346588. 156266 34806. NA
2 Year1 3 342573 NA 34652. 292001.
3 Year1 5 286285. 129257. 29645. 252786.
4 Year1 7 234410. NA 24536. NA
5 Year1 9 184733. 82944. NA 170653
6 Year1 10 NA 81419. 19461 167273.
7 Year1 11 169620. 74688. 18065 155442
8 Year1 14 107652 48381. 11941. 100076
9 Year1 15 88440 39807 10123. 83137
10 Year1 17 NA 31608 7926 64551.
11 Year1 18 63622 29236 7444. 58848.
12 Year1 22 14143. 6366. 1683. 10889.
13 Year2 22 279904 102271 28221. 138804.
14 Year2 25 200386 78628. 21942 NA
15 Year2 26 157182. NA 18099. 91963.
16 Year2 28 121122. 54538 14532. 76422
17 Year2 30 25899. 16773 489. NA
18 Year2 32 112091. 51219. 11298. 71655.
19 Year2 33 108756 49311. 10589. 70167
20 Year2 34 NA 49127. NA 69195.
21 Year2 36 104827 42651. 8568. 63580.
22 Year2 38 44849 14114 2302. 11652
23 Year2 40 104407. 42545 6240 63318.
24 Year2 41 99059. 38423 6766. 58017
25 Year2 42 NA 40432. NA 57932.
26 Year2 44 49119. 8796. 4769. 11233.
dput(df)
structure(list(Year = c("Year1", "Year1", "Year1", "Year1", "Year1",
"Year1", "Year1", "Year1", "Year1", "Year1", "Year1", "Year1",
"Year2", "Year2", "Year2", "Year2", "Year2", "Year2", "Year2",
"Year2", "Year2", "Year2", "Year2", "Year2", "Year2", "Year2"
), Trial = c(2, 3, 5, 7, 9, 10, 11, 14, 15, 17, 18, 22, 22, 25,
26, 28, 30, 32, 33, 34, 36, 38, 40, 41, 42, 44), Group1 = c(346587.6667,
342573, 286285.3333, 234409.6667, 184733.3333, NA, 169620.3333,
107652, 88440, NA, 63622, 14143.33333, 279904, 200386, 157182.3333,
121122.3333, 25899.33333, 112090.6667, 108756, NA, 104827, 44849,
104407.3333, 99058.66667, NA, 49119.33333), Group2 = c(156266,
NA, 129257.3333, NA, 82943.66667, 81419.33333, 74688.33333, 48381.33333,
39807, 31608, 29236, 6365.666667, 102271, 78628.33333, NA, 54538,
16773, 51218.66667, 49311.33333, 49127.33333, 42650.66667, 14114,
42545, 38423, 40432.33333, 8795.666667), Group3 = c(34805.66667,
34651.66667, 29644.66667, 24535.66667, NA, 19461, 18065, 11941.33333,
10123.33333, 7926, 7444.333333, 1683.333333, 28221.33333, 21942,
18099.33333, 14532.33333, 489.3333333, 11297.66667, 10588.66667,
NA, 8567.666667, 2302.333333, 6240, 6765.666667, NA, 4769.333333
), Group4 = c(NA, 292000.6667, 252785.6667, NA, 170653, 167273.3333,
155442, 100076, 83137, 64551.33333, 58847.66667, 10888.66667,
138803.6667, NA, 91963.33333, 76422, NA, 71655.33333, 70167,
69195.33333, 63579.66667, 11652, 63317.66667, 58017, 57932.33333,
11232.66667)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -26L), spec = structure(list(cols = list(
Year = structure(list(), class = c("collector_character",
"collector")), Trial = structure(list(), class = c("collector_double",
"collector")), Group1 = structure(list(), class = c("collector_double",
"collector")), Group2 = structure(list(), class = c("collector_double",
"collector")), Group3 = structure(list(), class = c("collector_double",
"collector")), Group4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
基本上,我需要用之前的试验(试验按降序排列)填充 na 值。例如,我需要用第 6 行第 4 列的数据填充第 6 行第 3 列。
但这还不是全部。我需要为缺少试验的日子创建一行,然后用最后一次试验填充这些行。这就是我被挂断的事情。有没有办法同时实现这两个目标?
例如,我需要将 tail(df) 从 A 更改为 B。
A.
Year Trial Group1 Group2 Group3 Group4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Year2 40 104407. 42545 6240 63318.
2 Year2 41 99059. 38423 6766. 58017
3 Year2 42 NA 40432. NA 57932.
4 Year2 44 49119. 8796. 4769. 11233.
B.
Year Trial Group1 Group2 Group3 Group4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Year2 40 104407. 42545 6240 63318.
2 Year2 41 99059. 38423 6766. 58017
3 Year2 42 49119. 40432. 4769. 57932.
4 Year2 43 49119. 40432. 4769. 57932.
5 Year2 44 49119. 8796. 4769. 11233.
您可以将 complete
和 fill
与 .direction = 'up'
library(dplyr)
library(tidyr)
df %>%
group_by(Year) %>%
complete(Trial = min(Trial):max(Trial)) %>%
fill(starts_with('Group'), .direction = 'up') %>%
ungroup
# A tibble: 44 x 6
# Year Trial Group1 Group2 Group3 Group4
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Year1 2 346588. 156266 34806. 292001.
# 2 Year1 3 342573 129257. 34652. 292001.
# 3 Year1 4 286285. 129257. 29645. 252786.
# 4 Year1 5 286285. 129257. 29645. 252786.
# 5 Year1 6 234410. 82944. 24536. 170653
# 6 Year1 7 234410. 82944. 24536. 170653
# 7 Year1 8 184733. 82944. 19461 170653
# 8 Year1 9 184733. 82944. 19461 170653
# 9 Year1 10 169620. 81419. 19461 167273.
#10 Year1 11 169620. 74688. 18065 155442
# … with 34 more rows