使用线性近似值对 NA 观测值进行插补
Imputation for bounding NA observations using a linear approximation
我想在数组的开头估算 NA 观察值,使用以下两个非 NA 观察值的线性近似值来推断缺失值。然后使用前面的两个非 NA 观察值对数组末尾的 NA 观察值执行相同的操作。
我的 df 的可重现示例:
M=matrix(sample(1:9,10*10,T),10);M[sample(1:length(M),0.5*length(M),F)]=NA;dimnames(M)=list(paste(rep("City",dim(M)[1]),1:dim(M)[1],sep=""),paste(rep("Year",dim(M)[2]),1:dim(M)[2],sep=""))
M
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 NA 4 5 NA 3 NA NA NA 5 NA
City2 6 NA 3 3 NA 4 6 NA NA 7
City3 NA 7 NA 8 8 NA NA 8 NA 5
City4 3 5 3 NA NA 3 5 9 8 7
City5 4 6 6 NA NA 8 NA 7 1 NA
City6 NA NA NA NA 4 NA 8 3 6 7
City7 9 3 NA NA NA NA NA 4 NA NA
City8 5 6 9 8 5 NA NA 1 4 NA
City9 NA NA 6 NA 3 3 8 NA 7 NA
City10 NA NA NA NA NA NA NA NA NA 1
idx=rowSums(!is.na(M))>=2 # Index of rows with 2 or more non-NA to run na.approx
library(zoo)
M[idx,]=t(na.approx(t(M[idx,]),rule=1,method="linear")) # I'm using t as na.approx works on columns
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 NA 4.0 5 4.0 3.000000 3.50 4.0 4.5 5 NA
City2 6.0 5.5 3 3.0 5.500000 4.00 6.0 6.0 6 7
City3 4.5 7.0 3 8.0 8.000000 3.50 5.5 8.0 7 5
City4 3.0 5.0 3 8.0 6.666667 3.00 5.0 9.0 8 7
City5 4.0 6.0 6 8.0 5.333333 8.00 6.5 7.0 1 7
City6 6.5 4.5 7 8.0 4.000000 6.75 8.0 3.0 6 7
City7 9.0 3.0 8 8.0 4.500000 5.50 8.0 4.0 5 NA
City8 5.0 6.0 9 8.0 5.000000 4.25 8.0 1.0 4 NA
City9 NA NA 6 4.5 3.000000 3.00 8.0 7.5 7 NA
City10 NA NA NA NA NA NA NA NA NA 1
我想使用基于两个 preceding/following 观察结果的线性近似来推断边界(对于 City1
和 City9
)。例如 M[1,1]
应该是 3
而 M[1,10]
应该是 5,5
.
你知道我该怎么做吗?
这为您提供了第一列,其中填充了 NA
的线性外推值。您可以适应最后一列。
firstNAfill <- function(x) {
ans <- ifelse(!is.na(x[1]),
x[1],
ifelse(sum(!is.na(x))<2, NA,
2*x[which(!is.na(x[1, ]))[1]] - x[which(!is.na(x[1, ]))[2]]
)
)
return(ans)
}
dat$Year1 <- unlist(lapply(seq(1:nrow(dat)), function(x) {firstNAfill(dat[x, ])}))
结果:
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 3.0 4.0 5 4.0 3.000000 3.50 4.0 4.5 5 NA
City2 6.0 5.5 3 3.0 5.500000 4.00 6.0 6.0 6 7
City3 4.5 7.0 3 8.0 8.000000 3.50 5.5 8.0 7 5
City4 3.0 5.0 3 8.0 6.666667 3.00 5.0 9.0 8 7
City5 4.0 6.0 6 8.0 5.333333 8.00 6.5 7.0 1 7
City6 6.5 4.5 7 8.0 4.000000 6.75 8.0 3.0 6 7
City7 9.0 3.0 8 8.0 4.500000 5.50 8.0 4.0 5 NA
City8 5.0 6.0 9 8.0 5.000000 4.25 8.0 1.0 4 NA
City9 7.5 NA 6 4.5 3.000000 3.00 8.0 7.5 7 NA
City10 NA NA NA NA NA NA NA NA NA 1
函数 returns 第一列的当前值,如果不是 NA
,NA
如果没有两个值可以外推,否则为外推值。
在extrap
中,nlead
是输入向量x
中前导NA的数量。 non.na
是 x
的非 NA 元素的子集。 Return 如果没有前导 NA 元素或少于 2 个非 NA 元素,则为输入。 m
是前两个非 NA 的斜率。用外推替换 x
的前 nlead
个元素。最后,我们使用 MM[] <-
将 extrap
应用于 M
的每一行,以便保留列名,然后反转每一行,重复并反转:
library(zoo)
extrap <- function(x) {
nlead <- which.min(x * 0) - 1
non.na <- na.omit(x)
if (length(nlead) == 0 || nlead == 0) || length(non.na) < 2) return(x)
m <- diff(head(non.na, 2))
replace(x, seq_len(nlead), non.na[1] - nlead:1 * m)
}
nc <- ncol(M)
naApprox <- function(x) if (length(na.omit(x)) < 2) x else na.approx(x, na.rm = FALSE)
MM <- M
MM[] <- t(apply(MM, 1, naApprox))
MM[] <- t(apply(MM, 1, extrap)) # extraploate to fill leading NAs
MM[] <- t(apply(MM[, nc:1], 1, extrap))[, nc:1] # extrapolate to fill trailing NAs
给予:
> MM
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 3.0 4.0 5.000000 4.000000 3.000000 3.500000 4.000000 4.500000 5.000000 5.500000
City2 6.0 4.5 3.000000 3.000000 3.500000 4.000000 6.000000 6.333333 6.666667 7.000000
City3 6.5 7.0 7.500000 8.000000 8.000000 8.000000 8.000000 8.000000 6.500000 5.000000
City4 3.0 5.0 3.000000 3.000000 3.000000 3.000000 5.000000 9.000000 8.000000 7.000000
City5 4.0 6.0 6.000000 6.666667 7.333333 8.000000 7.500000 7.000000 1.000000 -5.000000
City6 -4.0 -2.0 0.000000 2.000000 4.000000 6.000000 8.000000 3.000000 6.000000 7.000000
City7 9.0 3.0 3.166667 3.333333 3.500000 3.666667 3.833333 4.000000 4.166667 4.333333
City8 5.0 6.0 9.000000 8.000000 5.000000 3.666667 2.333333 1.000000 4.000000 7.000000
City9 9.0 7.5 6.000000 4.500000 3.000000 3.000000 8.000000 7.500000 7.000000 6.500000
City10 NA NA NA NA NA NA NA NA NA 1.000000
注意我们用这个作为M
:
M <- structure(c(NA, 6L, NA, 3L, 4L, NA, 9L, 5L, NA, NA, 4L, NA, 7L,
5L, 6L, NA, 3L, 6L, NA, NA, 5L, 3L, NA, 3L, 6L, NA, NA, 9L, 6L,
NA, NA, 3L, 8L, NA, NA, NA, NA, 8L, NA, NA, 3L, NA, 8L, NA, NA,
4L, NA, 5L, 3L, NA, NA, 4L, NA, 3L, 8L, NA, NA, NA, 3L, NA, NA,
6L, NA, 5L, NA, 8L, NA, NA, 8L, NA, NA, NA, 8L, 9L, 7L, 3L, 4L,
1L, NA, NA, 5L, NA, NA, 8L, 1L, 6L, NA, 4L, 7L, NA, NA, 7L, 5L,
7L, NA, 7L, NA, NA, NA, 1L), .Dim = c(10L, 10L), .Dimnames = list(
c("City1", "City2", "City3", "City4", "City5", "City6", "City7",
"City8", "City9", "City10"), c("Year1", "Year2", "Year3",
"Year4", "Year5", "Year6", "Year7", "Year8", "Year9", "Year10"
)))
更新:已修复。
我想在数组的开头估算 NA 观察值,使用以下两个非 NA 观察值的线性近似值来推断缺失值。然后使用前面的两个非 NA 观察值对数组末尾的 NA 观察值执行相同的操作。
我的 df 的可重现示例:
M=matrix(sample(1:9,10*10,T),10);M[sample(1:length(M),0.5*length(M),F)]=NA;dimnames(M)=list(paste(rep("City",dim(M)[1]),1:dim(M)[1],sep=""),paste(rep("Year",dim(M)[2]),1:dim(M)[2],sep=""))
M
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 NA 4 5 NA 3 NA NA NA 5 NA
City2 6 NA 3 3 NA 4 6 NA NA 7
City3 NA 7 NA 8 8 NA NA 8 NA 5
City4 3 5 3 NA NA 3 5 9 8 7
City5 4 6 6 NA NA 8 NA 7 1 NA
City6 NA NA NA NA 4 NA 8 3 6 7
City7 9 3 NA NA NA NA NA 4 NA NA
City8 5 6 9 8 5 NA NA 1 4 NA
City9 NA NA 6 NA 3 3 8 NA 7 NA
City10 NA NA NA NA NA NA NA NA NA 1
idx=rowSums(!is.na(M))>=2 # Index of rows with 2 or more non-NA to run na.approx
library(zoo)
M[idx,]=t(na.approx(t(M[idx,]),rule=1,method="linear")) # I'm using t as na.approx works on columns
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 NA 4.0 5 4.0 3.000000 3.50 4.0 4.5 5 NA
City2 6.0 5.5 3 3.0 5.500000 4.00 6.0 6.0 6 7
City3 4.5 7.0 3 8.0 8.000000 3.50 5.5 8.0 7 5
City4 3.0 5.0 3 8.0 6.666667 3.00 5.0 9.0 8 7
City5 4.0 6.0 6 8.0 5.333333 8.00 6.5 7.0 1 7
City6 6.5 4.5 7 8.0 4.000000 6.75 8.0 3.0 6 7
City7 9.0 3.0 8 8.0 4.500000 5.50 8.0 4.0 5 NA
City8 5.0 6.0 9 8.0 5.000000 4.25 8.0 1.0 4 NA
City9 NA NA 6 4.5 3.000000 3.00 8.0 7.5 7 NA
City10 NA NA NA NA NA NA NA NA NA 1
我想使用基于两个 preceding/following 观察结果的线性近似来推断边界(对于 City1
和 City9
)。例如 M[1,1]
应该是 3
而 M[1,10]
应该是 5,5
.
你知道我该怎么做吗?
这为您提供了第一列,其中填充了 NA
的线性外推值。您可以适应最后一列。
firstNAfill <- function(x) {
ans <- ifelse(!is.na(x[1]),
x[1],
ifelse(sum(!is.na(x))<2, NA,
2*x[which(!is.na(x[1, ]))[1]] - x[which(!is.na(x[1, ]))[2]]
)
)
return(ans)
}
dat$Year1 <- unlist(lapply(seq(1:nrow(dat)), function(x) {firstNAfill(dat[x, ])}))
结果:
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 3.0 4.0 5 4.0 3.000000 3.50 4.0 4.5 5 NA
City2 6.0 5.5 3 3.0 5.500000 4.00 6.0 6.0 6 7
City3 4.5 7.0 3 8.0 8.000000 3.50 5.5 8.0 7 5
City4 3.0 5.0 3 8.0 6.666667 3.00 5.0 9.0 8 7
City5 4.0 6.0 6 8.0 5.333333 8.00 6.5 7.0 1 7
City6 6.5 4.5 7 8.0 4.000000 6.75 8.0 3.0 6 7
City7 9.0 3.0 8 8.0 4.500000 5.50 8.0 4.0 5 NA
City8 5.0 6.0 9 8.0 5.000000 4.25 8.0 1.0 4 NA
City9 7.5 NA 6 4.5 3.000000 3.00 8.0 7.5 7 NA
City10 NA NA NA NA NA NA NA NA NA 1
函数 returns 第一列的当前值,如果不是 NA
,NA
如果没有两个值可以外推,否则为外推值。
在extrap
中,nlead
是输入向量x
中前导NA的数量。 non.na
是 x
的非 NA 元素的子集。 Return 如果没有前导 NA 元素或少于 2 个非 NA 元素,则为输入。 m
是前两个非 NA 的斜率。用外推替换 x
的前 nlead
个元素。最后,我们使用 MM[] <-
将 extrap
应用于 M
的每一行,以便保留列名,然后反转每一行,重复并反转:
library(zoo)
extrap <- function(x) {
nlead <- which.min(x * 0) - 1
non.na <- na.omit(x)
if (length(nlead) == 0 || nlead == 0) || length(non.na) < 2) return(x)
m <- diff(head(non.na, 2))
replace(x, seq_len(nlead), non.na[1] - nlead:1 * m)
}
nc <- ncol(M)
naApprox <- function(x) if (length(na.omit(x)) < 2) x else na.approx(x, na.rm = FALSE)
MM <- M
MM[] <- t(apply(MM, 1, naApprox))
MM[] <- t(apply(MM, 1, extrap)) # extraploate to fill leading NAs
MM[] <- t(apply(MM[, nc:1], 1, extrap))[, nc:1] # extrapolate to fill trailing NAs
给予:
> MM
Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1 3.0 4.0 5.000000 4.000000 3.000000 3.500000 4.000000 4.500000 5.000000 5.500000
City2 6.0 4.5 3.000000 3.000000 3.500000 4.000000 6.000000 6.333333 6.666667 7.000000
City3 6.5 7.0 7.500000 8.000000 8.000000 8.000000 8.000000 8.000000 6.500000 5.000000
City4 3.0 5.0 3.000000 3.000000 3.000000 3.000000 5.000000 9.000000 8.000000 7.000000
City5 4.0 6.0 6.000000 6.666667 7.333333 8.000000 7.500000 7.000000 1.000000 -5.000000
City6 -4.0 -2.0 0.000000 2.000000 4.000000 6.000000 8.000000 3.000000 6.000000 7.000000
City7 9.0 3.0 3.166667 3.333333 3.500000 3.666667 3.833333 4.000000 4.166667 4.333333
City8 5.0 6.0 9.000000 8.000000 5.000000 3.666667 2.333333 1.000000 4.000000 7.000000
City9 9.0 7.5 6.000000 4.500000 3.000000 3.000000 8.000000 7.500000 7.000000 6.500000
City10 NA NA NA NA NA NA NA NA NA 1.000000
注意我们用这个作为M
:
M <- structure(c(NA, 6L, NA, 3L, 4L, NA, 9L, 5L, NA, NA, 4L, NA, 7L,
5L, 6L, NA, 3L, 6L, NA, NA, 5L, 3L, NA, 3L, 6L, NA, NA, 9L, 6L,
NA, NA, 3L, 8L, NA, NA, NA, NA, 8L, NA, NA, 3L, NA, 8L, NA, NA,
4L, NA, 5L, 3L, NA, NA, 4L, NA, 3L, 8L, NA, NA, NA, 3L, NA, NA,
6L, NA, 5L, NA, 8L, NA, NA, 8L, NA, NA, NA, 8L, 9L, 7L, 3L, 4L,
1L, NA, NA, 5L, NA, NA, 8L, 1L, 6L, NA, 4L, 7L, NA, NA, 7L, 5L,
7L, NA, 7L, NA, NA, NA, 1L), .Dim = c(10L, 10L), .Dimnames = list(
c("City1", "City2", "City3", "City4", "City5", "City6", "City7",
"City8", "City9", "City10"), c("Year1", "Year2", "Year3",
"Year4", "Year5", "Year6", "Year7", "Year8", "Year9", "Year10"
)))
更新:已修复。