是否有一个 R 函数可以按组连续地估算缺失的年份值?
Is there an R function for imputing missing year values, consecutively, by group?
我的数据框看起来像:
df <- data.frame(ID=c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
grade=c("KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03"),
year=c(2002, 2003, NA, 2005,
2007, NA, NA, 2010,
NA, 2005, 2006, NA,
2009, 2010, NA, NA))
我希望能够通过 ID
估算缺失的 year
值,并获得以下预期结果:
wanted_df <- data.frame(ID=c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
grade=c("KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03"),
year=c(2002, 2003, 2004, 2005,
2007, 2008, 2009, 2010,
2004, 2005, 2006, 2007,
2009, 2010, 2011, 2012))
我尝试使用以下方法估算值:
lag()
和 lead()
函数
- 加入由年份组成的数据框
都没有用。任何帮助将不胜感激。谢谢。
我们可以用na_interpolate/na_extrapolate
library(dplyr)
# remotes::install_github("skgrange/threadr")
library(threadr)
df %>%
group_by(ID) %>%
mutate(year = na_extrapolate(na_interpolate(year))) %>%
ungroup
-输出
# A tibble: 16 × 3
ID grade year
<chr> <chr> <dbl>
1 A KG 2002
2 A 01 2003
3 A 02 2004
4 A 03 2005
5 B KG 2007
6 B 01 2008
7 B 02 2009
8 B 03 2010
9 C KG 2004.
10 C 01 2005
11 C 02 2006
12 C 03 2007
13 D KG 2009
14 D 01 2010
15 D 02 2011
16 D 03 2012.
可以这样做:
year_imputer <- function(years){
# Find one non-missing data-point
ref_indx <- which(!is.na(years))[1]
# Make it the reference point
ref <- years[ref_indx]
# Get the length of the years
years_len <- length(years)
# Generate the sequence
(ref - (ref_indx - 1)):(ref+(years_len - ref_indx))
}
library(dplyr)
df %>%
group_by(ID) %>%
mutate(
year = year_imputer(year)
) %>%
ungroup()
输出:
# A tibble: 16 x 3
ID grade year
<chr> <chr> <int>
1 A KG 2002
2 A 01 2003
3 A 02 2004
4 A 03 2005
5 B KG 2007
6 B 01 2008
7 B 02 2009
8 B 03 2010
9 C KG 2004
10 C 01 2005
11 C 02 2006
12 C 03 2007
13 D KG 2009
14 D 01 2010
15 D 02 2011
16 D 03 2012
这不是我推荐的解决方案:
但是这个问题引发了另一个问题,看这里
library(dplyr)
df %>%
group_by(ID) %>%
mutate(year= ifelse(is.na(year), lag(year)+1, year),
year= ifelse(is.na(year), lag(year)+1, year),
year= ifelse(is.na(year), lead(year)-1, year))
ID grade year
<chr> <chr> <dbl>
1 A KG 2002
2 A 01 2003
3 A 02 2004
4 A 03 2005
5 B KG 2007
6 B 01 2008
7 B 02 2009
8 B 03 2010
9 C KG 2004
10 C 01 2005
11 C 02 2006
12 C 03 2007
13 D KG 2009
14 D 01 2010
15 D 02 2011
16 D 03 2012
您可以针对您的特定问题使用任何(线性)插值函数。例如。来自 imputeTS
包,也来自 zoo
和其他包。但是,当您处理 POSIXct 数据类型而不仅仅是数字时,此解决方案可能不再适用。
同样值得注意的是,这仅在每个缺失的年份实际上作为 NA 值插入(而不是仅仅被遗漏)时才有效。对于这种称为隐式缺失值的情况(遗漏的年份),tsibble
包有一个名为 fill_gaps()
.
的函数
library("imputeTS")
library("dplyr")
df <- data.frame(ID=c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
grade=c("KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03"),
year=c(2002, 2003, NA, 2005,
2007, NA, NA, 2010,
NA, 2005, 2006, NA,
2009, 2010, NA, NA))
df %>% group_by(ID) %>% mutate(year = na_interpolation(year)) %>% ungroup
另一种解决方案:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(year = c(min(year,na.rm = T) - (which.min(year)-1):0,
min(year,na.rm = T) + 1:(n()-which.min(year)))) %>% ungroup
#> # A tibble: 16 × 3
#> ID grade year
#> <chr> <chr> <dbl>
#> 1 A KG 2002
#> 2 A 01 2003
#> 3 A 02 2004
#> 4 A 03 2005
#> 5 B KG 2007
#> 6 B 01 2008
#> 7 B 02 2009
#> 8 B 03 2010
#> 9 C KG 2004
#> 10 C 01 2005
#> 11 C 02 2006
#> 12 C 03 2007
#> 13 D KG 2009
#> 14 D 01 2010
#> 15 D 02 2011
#> 16 D 03 2012
我的数据框看起来像:
df <- data.frame(ID=c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
grade=c("KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03"),
year=c(2002, 2003, NA, 2005,
2007, NA, NA, 2010,
NA, 2005, 2006, NA,
2009, 2010, NA, NA))
我希望能够通过 ID
估算缺失的 year
值,并获得以下预期结果:
wanted_df <- data.frame(ID=c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
grade=c("KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03"),
year=c(2002, 2003, 2004, 2005,
2007, 2008, 2009, 2010,
2004, 2005, 2006, 2007,
2009, 2010, 2011, 2012))
我尝试使用以下方法估算值:
lag()
和lead()
函数- 加入由年份组成的数据框
都没有用。任何帮助将不胜感激。谢谢。
我们可以用na_interpolate/na_extrapolate
library(dplyr)
# remotes::install_github("skgrange/threadr")
library(threadr)
df %>%
group_by(ID) %>%
mutate(year = na_extrapolate(na_interpolate(year))) %>%
ungroup
-输出
# A tibble: 16 × 3
ID grade year
<chr> <chr> <dbl>
1 A KG 2002
2 A 01 2003
3 A 02 2004
4 A 03 2005
5 B KG 2007
6 B 01 2008
7 B 02 2009
8 B 03 2010
9 C KG 2004.
10 C 01 2005
11 C 02 2006
12 C 03 2007
13 D KG 2009
14 D 01 2010
15 D 02 2011
16 D 03 2012.
可以这样做:
year_imputer <- function(years){
# Find one non-missing data-point
ref_indx <- which(!is.na(years))[1]
# Make it the reference point
ref <- years[ref_indx]
# Get the length of the years
years_len <- length(years)
# Generate the sequence
(ref - (ref_indx - 1)):(ref+(years_len - ref_indx))
}
library(dplyr)
df %>%
group_by(ID) %>%
mutate(
year = year_imputer(year)
) %>%
ungroup()
输出:
# A tibble: 16 x 3
ID grade year
<chr> <chr> <int>
1 A KG 2002
2 A 01 2003
3 A 02 2004
4 A 03 2005
5 B KG 2007
6 B 01 2008
7 B 02 2009
8 B 03 2010
9 C KG 2004
10 C 01 2005
11 C 02 2006
12 C 03 2007
13 D KG 2009
14 D 01 2010
15 D 02 2011
16 D 03 2012
这不是我推荐的解决方案:
但是这个问题引发了另一个问题,看这里
library(dplyr)
df %>%
group_by(ID) %>%
mutate(year= ifelse(is.na(year), lag(year)+1, year),
year= ifelse(is.na(year), lag(year)+1, year),
year= ifelse(is.na(year), lead(year)-1, year))
ID grade year
<chr> <chr> <dbl>
1 A KG 2002
2 A 01 2003
3 A 02 2004
4 A 03 2005
5 B KG 2007
6 B 01 2008
7 B 02 2009
8 B 03 2010
9 C KG 2004
10 C 01 2005
11 C 02 2006
12 C 03 2007
13 D KG 2009
14 D 01 2010
15 D 02 2011
16 D 03 2012
您可以针对您的特定问题使用任何(线性)插值函数。例如。来自 imputeTS
包,也来自 zoo
和其他包。但是,当您处理 POSIXct 数据类型而不仅仅是数字时,此解决方案可能不再适用。
同样值得注意的是,这仅在每个缺失的年份实际上作为 NA 值插入(而不是仅仅被遗漏)时才有效。对于这种称为隐式缺失值的情况(遗漏的年份),tsibble
包有一个名为 fill_gaps()
.
library("imputeTS")
library("dplyr")
df <- data.frame(ID=c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
grade=c("KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03"),
year=c(2002, 2003, NA, 2005,
2007, NA, NA, 2010,
NA, 2005, 2006, NA,
2009, 2010, NA, NA))
df %>% group_by(ID) %>% mutate(year = na_interpolation(year)) %>% ungroup
另一种解决方案:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(year = c(min(year,na.rm = T) - (which.min(year)-1):0,
min(year,na.rm = T) + 1:(n()-which.min(year)))) %>% ungroup
#> # A tibble: 16 × 3
#> ID grade year
#> <chr> <chr> <dbl>
#> 1 A KG 2002
#> 2 A 01 2003
#> 3 A 02 2004
#> 4 A 03 2005
#> 5 B KG 2007
#> 6 B 01 2008
#> 7 B 02 2009
#> 8 B 03 2010
#> 9 C KG 2004
#> 10 C 01 2005
#> 11 C 02 2006
#> 12 C 03 2007
#> 13 D KG 2009
#> 14 D 01 2010
#> 15 D 02 2011
#> 16 D 03 2012