用面板数据中特定国家/地区的前一年值替换 NA
Replacing NAs with prior year value for specific country in panel data
我合并了两个数据框,称它们为 A 和 B。一个有每年重要变量的值,但有一些缺失数据,我将单独处理这些数据。第二个只有特定年份(选举年)的值。这是跨国面板数据,以国家/年为观察单位,因此在任何操作中区分国家/地区和年份非常重要。合并后,非选举年的第二个数据框的数据如预期的那样具有 NA 值。这些 NA 需要填写该特定国家/地区上一次选举的数据,直到该国家/地区的下一次选举。我不想为数据帧 A 中的数据填写任何 NA。
(理论上有问题的朋友,B的数据是关于执政党的,所以这样填理论上是合理的。)
如果我按国家/地区对数据进行子集化,我可以使用 tidy::fill 函数轻松完成此操作,方法是仅选择包含来自 B 的数据的列。对于包含所有国家/地区的完整数据框,我无法执行此操作因为在某些情况下,它会用数据框中前一个国家/地区的值填充一个国家/地区的开始年份。
这里是数据排列的一个最小示例(请记住,实际数据中实际上有 190 个国家和 9282 个观测值):
country <- c("Austria","Austria","Austria","Austria","Austria",
"Belgium","Belgium","Belgium","Belgium","Belgium")
year <- c("1999","2000","2001","2002","2003",
"1999","2000","2001","2002","2003")
a1 <- c(5,4,NA,4,3,6,2,9,NA,7)
a2 <- c(45,53,57,51,33,37,12,48,55,41)
b1 <- c(NA,"A",NA,NA,NA,NA,NA,"B",NA,"C")
b2 <- c(NA,7,NA,NA,NA,NA,NA,5,NA,7)
df <- data.frame(country,year,a1,a2,b1,b2)
country
year
a1
a2
b1
b2
Austria
1999
5
45
NA
NA
Austria
2000
4
53
A
7
Austria
2001
NA
57
NA
NA
Austria
2002
4
51
NA
NA
Austria
2003
3
33
NA
NA
Belgium
1999
6
37
NA
NA
Belgium
2000
2
12
NA
NA
Belgium
2001
9
48
B
5
Belgium
2002
NA
55
NA
NA
Belgium
2003
7
41
C
7
这是我想要制作的:
country
year
a1
a2
b1
b2
Austria
1999
5
45
NA
NA
Austria
2000
4
53
A
7
Austria
2001
NA
57
A
7
Austria
2002
4
51
A
7
Austria
2003
3
33
A
7
Belgium
1999
6
37
NA
NA
Belgium
2000
2
12
NA
NA
Belgium
2001
9
48
B
5
Belgium
2002
NA
55
B
5
Belgium
2003
7
41
C
7
在示例中,简单地使用 tidy::fill 将导致比利时 1999 年和 2000 年的值不正确,因为它将填充奥地利的值。
正如Peace Wang 在评论中建议的那样,您只需要group_by(country)
。您可以利用 tidy-select
来专门 fill
来自 df B.
的列
library(tidyverse)
country <- c("Austria","Austria","Austria","Austria","Austria",
"Belgium","Belgium","Belgium","Belgium","Belgium")
year <- c("1999","2000","2001","2002","2003",
"1999","2000","2001","2002","2003")
a1 <- c(5,4,NA,4,3,6,2,9,NA,7)
a2 <- c(45,53,57,51,33,37,12,48,55,41)
b1 <- c(NA,"A",NA,NA,NA,NA,NA,"B",NA,"C")
b2 <- c(NA,7,NA,NA,NA,NA,NA,5,NA,7)
df <- data.frame(country,year,a1,a2,b1,b2)
df %>%
group_by(country) %>%
arrange(year) %>%
fill(starts_with("b"), .direction = "down") %>%
arrange(country)
#> # A tibble: 10 x 6
#> # Groups: country [2]
#> country year a1 a2 b1 b2
#> <chr> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 Austria 1999 5 45 <NA> NA
#> 2 Austria 2000 4 53 A 7
#> 3 Austria 2001 NA 57 A 7
#> 4 Austria 2002 4 51 A 7
#> 5 Austria 2003 3 33 A 7
#> 6 Belgium 1999 6 37 <NA> NA
#> 7 Belgium 2000 2 12 <NA> NA
#> 8 Belgium 2001 9 48 B 5
#> 9 Belgium 2002 NA 55 B 5
#> 10 Belgium 2003 7 41 C 7
由 reprex package (v0.3.0)
于 2021-12-26 创建
我认为locf
组country
中的locf
(上次观察结转)nafill方法是你想要的。
library(data.table)
df = setDT(df)
cols = c("b1","b2")
df[,(cols):= lapply(.SD, zoo::na.locf, na.rm = FALSE),
.SDcols = cols,
by = .(country)]
# data.table::nafill now can only process numeric columns, e.g.
# df[, b2 := nafill(b2, type = c("locf"), by = .(country)]
你可以打开黑匣子然后做
toIm <- c("b1", "b2")
do.call(rbind, c(by(dat, dat$country, \(z) {
z[toIm] <- lapply(z[toIm], \(y) {
unlist(by(y, cumsum(!is.na(y)), \(x)
by(x, cumsum(!is.na(x)), \(w) rep(w[1], length(w)))))
})
z
}), make.row.names=F))
# country year a1 a2 b1 b2
# 1 Austria 1999 5 45 <NA> NA
# 2 Austria 2000 4 53 A 7
# 3 Austria 2001 NA 57 A 7
# 4 Austria 2002 4 51 A 7
# 5 Austria 2003 3 33 A 7
# 6 Belgium 1999 6 37 <NA> NA
# 7 Belgium 2000 2 12 <NA> NA
# 8 Belgium 2001 9 48 B 5
# 9 Belgium 2002 NA 55 B 5
# 10 Belgium 2003 7 41 C 7
注: R version 4.1.2 (2021-11-01)
数据:
dat <- structure(list(country = c("Austria", "Austria", "Austria", "Austria",
"Austria", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium"
), year = c(1999L, 2000L, 2001L, 2002L, 2003L, 1999L, 2000L,
2001L, 2002L, 2003L), a1 = c(5L, 4L, NA, 4L, 3L, 6L, 2L, 9L,
NA, 7L), a2 = c(45L, 53L, 57L, 51L, 33L, 37L, 12L, 48L, 55L,
41L), b1 = c(NA, "A", NA, NA, NA, NA, NA, "B", NA, "C"), b2 = c(NA,
7L, NA, NA, NA, NA, NA, 5L, NA, 7L)), class = "data.frame", row.names = c(NA,
-10L))
我合并了两个数据框,称它们为 A 和 B。一个有每年重要变量的值,但有一些缺失数据,我将单独处理这些数据。第二个只有特定年份(选举年)的值。这是跨国面板数据,以国家/年为观察单位,因此在任何操作中区分国家/地区和年份非常重要。合并后,非选举年的第二个数据框的数据如预期的那样具有 NA 值。这些 NA 需要填写该特定国家/地区上一次选举的数据,直到该国家/地区的下一次选举。我不想为数据帧 A 中的数据填写任何 NA。
(理论上有问题的朋友,B的数据是关于执政党的,所以这样填理论上是合理的。)
如果我按国家/地区对数据进行子集化,我可以使用 tidy::fill 函数轻松完成此操作,方法是仅选择包含来自 B 的数据的列。对于包含所有国家/地区的完整数据框,我无法执行此操作因为在某些情况下,它会用数据框中前一个国家/地区的值填充一个国家/地区的开始年份。
这里是数据排列的一个最小示例(请记住,实际数据中实际上有 190 个国家和 9282 个观测值):
country <- c("Austria","Austria","Austria","Austria","Austria",
"Belgium","Belgium","Belgium","Belgium","Belgium")
year <- c("1999","2000","2001","2002","2003",
"1999","2000","2001","2002","2003")
a1 <- c(5,4,NA,4,3,6,2,9,NA,7)
a2 <- c(45,53,57,51,33,37,12,48,55,41)
b1 <- c(NA,"A",NA,NA,NA,NA,NA,"B",NA,"C")
b2 <- c(NA,7,NA,NA,NA,NA,NA,5,NA,7)
df <- data.frame(country,year,a1,a2,b1,b2)
country | year | a1 | a2 | b1 | b2 |
---|---|---|---|---|---|
Austria | 1999 | 5 | 45 | NA | NA |
Austria | 2000 | 4 | 53 | A | 7 |
Austria | 2001 | NA | 57 | NA | NA |
Austria | 2002 | 4 | 51 | NA | NA |
Austria | 2003 | 3 | 33 | NA | NA |
Belgium | 1999 | 6 | 37 | NA | NA |
Belgium | 2000 | 2 | 12 | NA | NA |
Belgium | 2001 | 9 | 48 | B | 5 |
Belgium | 2002 | NA | 55 | NA | NA |
Belgium | 2003 | 7 | 41 | C | 7 |
这是我想要制作的:
country | year | a1 | a2 | b1 | b2 |
---|---|---|---|---|---|
Austria | 1999 | 5 | 45 | NA | NA |
Austria | 2000 | 4 | 53 | A | 7 |
Austria | 2001 | NA | 57 | A | 7 |
Austria | 2002 | 4 | 51 | A | 7 |
Austria | 2003 | 3 | 33 | A | 7 |
Belgium | 1999 | 6 | 37 | NA | NA |
Belgium | 2000 | 2 | 12 | NA | NA |
Belgium | 2001 | 9 | 48 | B | 5 |
Belgium | 2002 | NA | 55 | B | 5 |
Belgium | 2003 | 7 | 41 | C | 7 |
在示例中,简单地使用 tidy::fill 将导致比利时 1999 年和 2000 年的值不正确,因为它将填充奥地利的值。
正如Peace Wang 在评论中建议的那样,您只需要group_by(country)
。您可以利用 tidy-select
来专门 fill
来自 df B.
library(tidyverse)
country <- c("Austria","Austria","Austria","Austria","Austria",
"Belgium","Belgium","Belgium","Belgium","Belgium")
year <- c("1999","2000","2001","2002","2003",
"1999","2000","2001","2002","2003")
a1 <- c(5,4,NA,4,3,6,2,9,NA,7)
a2 <- c(45,53,57,51,33,37,12,48,55,41)
b1 <- c(NA,"A",NA,NA,NA,NA,NA,"B",NA,"C")
b2 <- c(NA,7,NA,NA,NA,NA,NA,5,NA,7)
df <- data.frame(country,year,a1,a2,b1,b2)
df %>%
group_by(country) %>%
arrange(year) %>%
fill(starts_with("b"), .direction = "down") %>%
arrange(country)
#> # A tibble: 10 x 6
#> # Groups: country [2]
#> country year a1 a2 b1 b2
#> <chr> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 Austria 1999 5 45 <NA> NA
#> 2 Austria 2000 4 53 A 7
#> 3 Austria 2001 NA 57 A 7
#> 4 Austria 2002 4 51 A 7
#> 5 Austria 2003 3 33 A 7
#> 6 Belgium 1999 6 37 <NA> NA
#> 7 Belgium 2000 2 12 <NA> NA
#> 8 Belgium 2001 9 48 B 5
#> 9 Belgium 2002 NA 55 B 5
#> 10 Belgium 2003 7 41 C 7
由 reprex package (v0.3.0)
于 2021-12-26 创建我认为locf
组country
中的locf
(上次观察结转)nafill方法是你想要的。
library(data.table)
df = setDT(df)
cols = c("b1","b2")
df[,(cols):= lapply(.SD, zoo::na.locf, na.rm = FALSE),
.SDcols = cols,
by = .(country)]
# data.table::nafill now can only process numeric columns, e.g.
# df[, b2 := nafill(b2, type = c("locf"), by = .(country)]
你可以打开黑匣子然后做
toIm <- c("b1", "b2")
do.call(rbind, c(by(dat, dat$country, \(z) {
z[toIm] <- lapply(z[toIm], \(y) {
unlist(by(y, cumsum(!is.na(y)), \(x)
by(x, cumsum(!is.na(x)), \(w) rep(w[1], length(w)))))
})
z
}), make.row.names=F))
# country year a1 a2 b1 b2
# 1 Austria 1999 5 45 <NA> NA
# 2 Austria 2000 4 53 A 7
# 3 Austria 2001 NA 57 A 7
# 4 Austria 2002 4 51 A 7
# 5 Austria 2003 3 33 A 7
# 6 Belgium 1999 6 37 <NA> NA
# 7 Belgium 2000 2 12 <NA> NA
# 8 Belgium 2001 9 48 B 5
# 9 Belgium 2002 NA 55 B 5
# 10 Belgium 2003 7 41 C 7
注: R version 4.1.2 (2021-11-01)
数据:
dat <- structure(list(country = c("Austria", "Austria", "Austria", "Austria",
"Austria", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium"
), year = c(1999L, 2000L, 2001L, 2002L, 2003L, 1999L, 2000L,
2001L, 2002L, 2003L), a1 = c(5L, 4L, NA, 4L, 3L, 6L, 2L, 9L,
NA, 7L), a2 = c(45L, 53L, 57L, 51L, 33L, 37L, 12L, 48L, 55L,
41L), b1 = c(NA, "A", NA, NA, NA, NA, NA, "B", NA, "C"), b2 = c(NA,
7L, NA, NA, NA, NA, NA, 5L, NA, 7L)), class = "data.frame", row.names = c(NA,
-10L))