如果某个观察结果出现不止一次,则通过总结将数据重塑为面板?
Reshaping the data into a panel by summing up if a certain obseravtion appears more than once?
我有四个不同村庄(A、B、C、D)的学校创建年份(year_est)的数据(df_input)。
df_input <- data.frame( school_id= c(1,2,3,4,5,6), village= c("A","B","B","C","D","D"), year_est = c(2002,2002,2004,2001,2004,2004))
df_output <- data.frame(year= c(2001,2002,2003,2004,2001,2002,2003,2004,2001,2002,2003,2004,2001,2002,2003,2004),
village = c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"),
school_est=c(0,1,1,1,0,1,1,2,1,1,1,1,0,0,0,2))
我正在尝试重塑 df_ouput 的格式,其中变量“school_est”如果在村子里建立了学校则取值 1,否则保持为 0。
此外,如果在一个特定的村庄建立了不止一所学校,那么变量school_est可以取大于1的值,例如,对于2004年的村庄B df_output.
在我的数据集中也很常见的是,同一年在同一个村庄建立了不止一所学校,2004年的D村就是这种情况。所以在df_output数据集中, school_est 从 2003 年的 0 到 2004 年的值 = 2。
谁能帮我解决这个问题?
我正在使用以下代码生成 df_output:
df_panel <- df_input %>%
merge(expand.grid(year=2001:2004, Village=.$Village), by="Village") %>%
mutate(across(year_est, ~ as.numeric(replace_na(.x <= year, 0))))
我们可以使用complete
library(dplyr)
library(tidyr)
df_input %>%
count(village, year = year_est, name = 'school_est') %>%
complete(village, year = min(year):max(year),
fill = list(school_est = 0)) %>%
mutate(school_est = ave(school_est, village, FUN = cumsum))
-输出
# A tibble: 16 x 3
village year school_est
<chr> <dbl> <dbl>
1 A 2001 0
2 A 2002 1
3 A 2003 1
4 A 2004 1
5 B 2001 0
6 B 2002 1
7 B 2003 1
8 B 2004 2
9 C 2001 1
10 C 2002 1
11 C 2003 1
12 C 2004 1
13 D 2001 0
14 D 2002 0
15 D 2003 0
16 D 2004 2
或使用base R
out <- transform(as.data.frame(table(transform(df_input,
year_est = factor(year_est, levels = min(year_est):max(year_est)))[-1])),
Freq = ave(Freq, village, FUN = cumsum))
out[order(out$village),]
village year_est Freq
1 A 2001 0
5 A 2002 1
9 A 2003 1
13 A 2004 1
2 B 2001 0
6 B 2002 1
10 B 2003 1
14 B 2004 2
3 C 2001 1
7 C 2002 1
11 C 2003 1
15 C 2004 1
4 D 2001 0
8 D 2002 0
12 D 2003 0
16 D 2004 2
花费的时间比预期的要长
library(tidyverse)
df_input <- data.frame( school_id= c(1,2,3,4,5,6), village= c("A","B","B","C","D","D"), year_est = c(2002,2002,2004,2001,2004,2004))
df_input %>%
group_by(village, year_est) %>%
summarise(school_est = n(), .groups = 'drop') %>%
complete(nesting(village), year_est = seq(min(year_est), max(year_est),1), fill = list(school_est = 0)) %>%
group_by(village) %>%
mutate(school_est = cumsum(school_est)) %>%
ungroup()
#> # A tibble: 16 x 3
#> village year_est school_est
#> <chr> <dbl> <dbl>
#> 1 A 2001 0
#> 2 A 2002 1
#> 3 A 2003 1
#> 4 A 2004 1
#> 5 B 2001 0
#> 6 B 2002 1
#> 7 B 2003 1
#> 8 B 2004 2
#> 9 C 2001 1
#> 10 C 2002 1
#> 11 C 2003 1
#> 12 C 2004 1
#> 13 D 2001 0
#> 14 D 2002 0
#> 15 D 2003 0
#> 16 D 2004 2
由 reprex package (v2.0.0)
于 2021-06-28 创建
我有四个不同村庄(A、B、C、D)的学校创建年份(year_est)的数据(df_input)。
df_input <- data.frame( school_id= c(1,2,3,4,5,6), village= c("A","B","B","C","D","D"), year_est = c(2002,2002,2004,2001,2004,2004))
df_output <- data.frame(year= c(2001,2002,2003,2004,2001,2002,2003,2004,2001,2002,2003,2004,2001,2002,2003,2004),
village = c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"),
school_est=c(0,1,1,1,0,1,1,2,1,1,1,1,0,0,0,2))
我正在尝试重塑 df_ouput 的格式,其中变量“school_est”如果在村子里建立了学校则取值 1,否则保持为 0。
此外,如果在一个特定的村庄建立了不止一所学校,那么变量school_est可以取大于1的值,例如,对于2004年的村庄B df_output.
在我的数据集中也很常见的是,同一年在同一个村庄建立了不止一所学校,2004年的D村就是这种情况。所以在df_output数据集中, school_est 从 2003 年的 0 到 2004 年的值 = 2。
谁能帮我解决这个问题?
我正在使用以下代码生成 df_output:
df_panel <- df_input %>%
merge(expand.grid(year=2001:2004, Village=.$Village), by="Village") %>%
mutate(across(year_est, ~ as.numeric(replace_na(.x <= year, 0))))
我们可以使用complete
library(dplyr)
library(tidyr)
df_input %>%
count(village, year = year_est, name = 'school_est') %>%
complete(village, year = min(year):max(year),
fill = list(school_est = 0)) %>%
mutate(school_est = ave(school_est, village, FUN = cumsum))
-输出
# A tibble: 16 x 3
village year school_est
<chr> <dbl> <dbl>
1 A 2001 0
2 A 2002 1
3 A 2003 1
4 A 2004 1
5 B 2001 0
6 B 2002 1
7 B 2003 1
8 B 2004 2
9 C 2001 1
10 C 2002 1
11 C 2003 1
12 C 2004 1
13 D 2001 0
14 D 2002 0
15 D 2003 0
16 D 2004 2
或使用base R
out <- transform(as.data.frame(table(transform(df_input,
year_est = factor(year_est, levels = min(year_est):max(year_est)))[-1])),
Freq = ave(Freq, village, FUN = cumsum))
out[order(out$village),]
village year_est Freq
1 A 2001 0
5 A 2002 1
9 A 2003 1
13 A 2004 1
2 B 2001 0
6 B 2002 1
10 B 2003 1
14 B 2004 2
3 C 2001 1
7 C 2002 1
11 C 2003 1
15 C 2004 1
4 D 2001 0
8 D 2002 0
12 D 2003 0
16 D 2004 2
花费的时间比预期的要长
library(tidyverse)
df_input <- data.frame( school_id= c(1,2,3,4,5,6), village= c("A","B","B","C","D","D"), year_est = c(2002,2002,2004,2001,2004,2004))
df_input %>%
group_by(village, year_est) %>%
summarise(school_est = n(), .groups = 'drop') %>%
complete(nesting(village), year_est = seq(min(year_est), max(year_est),1), fill = list(school_est = 0)) %>%
group_by(village) %>%
mutate(school_est = cumsum(school_est)) %>%
ungroup()
#> # A tibble: 16 x 3
#> village year_est school_est
#> <chr> <dbl> <dbl>
#> 1 A 2001 0
#> 2 A 2002 1
#> 3 A 2003 1
#> 4 A 2004 1
#> 5 B 2001 0
#> 6 B 2002 1
#> 7 B 2003 1
#> 8 B 2004 2
#> 9 C 2001 1
#> 10 C 2002 1
#> 11 C 2003 1
#> 12 C 2004 1
#> 13 D 2001 0
#> 14 D 2002 0
#> 15 D 2003 0
#> 16 D 2004 2
由 reprex package (v2.0.0)
于 2021-06-28 创建