有没有办法完成或扩展区间因子变量
Is there a way to complete or expand an interval factor variable
我有一个数据 frame/tibble,其中包含 bin 的因子变量。缺少分箱是因为原始数据不包括这 5 年范围内的观测值。有没有办法不用解构区间就可以轻松完成这个系列?
这是一个样本 df。
library(tibble)
df <- structure(list(bin = structure(c(1L, 3L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L), .Label = c("[1940,1945]",
"(1945,1950]", "(1950,1955]", "(1955,1960]", "(1960,1965]", "(1965,1970]",
"(1970,1975]", "(1975,1980]", "(1980,1985]", "(1985,1990]", "(1990,1995]",
"(1995,2000]", "(2000,2005]", "(2005,2010]", "(2010,2015]", "(2015,2020]",
"(2020,2025]"), class = "factor"), Values = c(2L, 4L, 14L, 11L,
8L, 26L, 30L, 87L, 107L, 290L, 526L, 299L, 166L, 502L, 8L)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
df
# A tibble: 15 x 2
bin Values
<fct> <int>
1 [1940,1945] 2
2 (1950,1955] 4
3 (1960,1965] 14
4 (1965,1970] 11
5 (1970,1975] 8
6 (1975,1980] 26
7 (1980,1985] 30
8 (1985,1990] 87
9 (1990,1995] 107
10 (1995,2000] 290
11 (2000,2005] 526
12 (2005,2010] 299
13 (2010,2015] 166
14 (2015,2020] 502
15 (2020,2025] 8
我想添加缺少的 (1945,1950]
和 (1955,1960]
垃圾箱。
df <- orig_df %>%
mutate(bin = cut_width(Year, width = 5, center = 2.5))
df2 <- df %>%
group_by(bin) %>%
summarize(Values = n()) %>%
ungroup()
tibble(bin = levels(df$bin)) %>%
left_join(df2) %>%
replace_na(list(Values = 0))
bins
已有您想要的 levels
。因此,您可以在 df
中使用 complete
作为 :
tidyr::complete(df, bin = levels(bin), fill = list(Values = 0))
# A tibble: 17 x 2
# bin Values
# <chr> <dbl>
# 1 (1945,1950] 0
# 2 (1950,1955] 4
# 3 (1955,1960] 0
# 4 (1960,1965] 14
# 5 (1965,1970] 11
# 6 (1970,1975] 8
# 7 (1975,1980] 26
# 8 (1980,1985] 30
# 9 (1985,1990] 87
#10 (1990,1995] 107
#11 (1995,2000] 290
#12 (2000,2005] 526
#13 (2005,2010] 299
#14 (2010,2015] 166
#15 (2015,2020] 502
#16 (2020,2025] 8
#17 [1940,1945] 2
我有一个数据 frame/tibble,其中包含 bin 的因子变量。缺少分箱是因为原始数据不包括这 5 年范围内的观测值。有没有办法不用解构区间就可以轻松完成这个系列?
这是一个样本 df。
library(tibble)
df <- structure(list(bin = structure(c(1L, 3L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L), .Label = c("[1940,1945]",
"(1945,1950]", "(1950,1955]", "(1955,1960]", "(1960,1965]", "(1965,1970]",
"(1970,1975]", "(1975,1980]", "(1980,1985]", "(1985,1990]", "(1990,1995]",
"(1995,2000]", "(2000,2005]", "(2005,2010]", "(2010,2015]", "(2015,2020]",
"(2020,2025]"), class = "factor"), Values = c(2L, 4L, 14L, 11L,
8L, 26L, 30L, 87L, 107L, 290L, 526L, 299L, 166L, 502L, 8L)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
df
# A tibble: 15 x 2
bin Values
<fct> <int>
1 [1940,1945] 2
2 (1950,1955] 4
3 (1960,1965] 14
4 (1965,1970] 11
5 (1970,1975] 8
6 (1975,1980] 26
7 (1980,1985] 30
8 (1985,1990] 87
9 (1990,1995] 107
10 (1995,2000] 290
11 (2000,2005] 526
12 (2005,2010] 299
13 (2010,2015] 166
14 (2015,2020] 502
15 (2020,2025] 8
我想添加缺少的 (1945,1950]
和 (1955,1960]
垃圾箱。
df <- orig_df %>%
mutate(bin = cut_width(Year, width = 5, center = 2.5))
df2 <- df %>%
group_by(bin) %>%
summarize(Values = n()) %>%
ungroup()
tibble(bin = levels(df$bin)) %>%
left_join(df2) %>%
replace_na(list(Values = 0))
bins
已有您想要的 levels
。因此,您可以在 df
中使用 complete
作为 :
tidyr::complete(df, bin = levels(bin), fill = list(Values = 0))
# A tibble: 17 x 2
# bin Values
# <chr> <dbl>
# 1 (1945,1950] 0
# 2 (1950,1955] 4
# 3 (1955,1960] 0
# 4 (1960,1965] 14
# 5 (1965,1970] 11
# 6 (1970,1975] 8
# 7 (1975,1980] 26
# 8 (1980,1985] 30
# 9 (1985,1990] 87
#10 (1990,1995] 107
#11 (1995,2000] 290
#12 (2000,2005] 526
#13 (2005,2010] 299
#14 (2010,2015] 166
#15 (2015,2020] 502
#16 (2020,2025] 8
#17 [1940,1945] 2