R:外推 x 号。超出已知值的值

R: Extrapolating x no. of values beyond known values

我正在寻找 function/method 来推断(线性)超出原始值的 x 个值。

假设我开始于:

a <- c(NA, NA, NA, NA, NA, NA, 1, 2, 3, NA, NA, NA, NA, NA, NA)

我想推断出两个值,我最终会得到:

[1] NA NA NA NA -1 0 1 2 3 4 5 NA NA NA NA

到目前为止我发现的是 Hmisc (https://rdrr.io/cran/Hmisc/man/approxExtrap.html) 的 approxExtrap 函数。但是既然你要定义'xout',我觉得我必须写一个循环,每次select块我都想推断。这当然是可能的,但最终我希望有数百万个数据点的序列,其中有很多间隙,所以我觉得这可能太耗时了。所以我希望我忽略了一个更简单的解决方案。

已添加:数据中没有小间隙,但通常是 ~ 100 NA,然后是 ~ 40 个数据点。我想 extrapolate/extend 40 个数据点在 40 个数据点开始之前和结束之后有 5 个新数据点,并在这两个位置替换 5 个 NA。无法在包含 40 个数据点的两个序列之间进行插值。

我通过以下方式解决了问题:

  1. 确定不同系列数据的范围
  2. 定义我想要推断的范围
  3. 通过 Hmisc 包进行实际外推

最初,我认为我只能通过一些必须逐行处理原始数据的循环来管理它,并希望有一个现有的函数。

我相信你们中的许多人会用这种方式编写更高效、更好的代码。但是我还是想 post 我的脚本供有类似问题的人使用。

require(Hmisc)
extrapol.length <- 5
test <- data.frame('Time' = c(1:100), # I didn't use this as my data was equally spread in time, if you want to use it, see the first argument in the approxExtrap-function in the secondlast line
                   'x' = c(rep(NA, 10), 1:30, rep(NA, 30), 1:10, rep(NA, 20))) 

## Determine start and end of the continuous (non-NA) data streams
length.values <- diff(c(0, which(is.na(test[,2]))))-2 # length non-NA's
length.values <- length.values[length.values > -1]
length.nas <- diff(c(0, which(!is.na(test[,2])))) # length NA's
length.nas <- length.nas[length.nas > 1]
if(is.na(test[1,2])){
  # data starts with NA
  length.nas <- data.frame('Order' = seq(1, length(length.nas)*2, by = 2),
                           'Length' = length.nas, 'Type' = 'na')
  length.values <- data.frame('Order' = seq(2, length(length.values)*2, by = 2),
                              'Length' = length.values, 'Type' = 'value')
  start.end <- rbind(length.nas, length.values)
  
  start.end <- start.end[order(start.end$Order),]
  
  value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'na'])),
                           'start' = NA, 'end' = NA)
  for(a in value.seqs$no){
    value.seqs$start[a] <- sum(start.end$Length[1:((a*2)-1)])
    value.seqs$end[a] <- sum(start.end$Length[1:(a*2)])
  }
}else{
  # Data starts with actual values
  length.nas <- data.frame('Order' = seq(2, length(length.nas)*2, by = 2),
                           'Length' = length.nas, 'Type' = 'na')
  length.values <- data.frame('Order' = seq(1, length(length.values)*2, by = 2),
                              'Length' = length.values, 'Type' = 'value')
  start.end <- rbind(length.nas, length.values)
  
  start.end <- start.end[order(start.end$Order),]
  
  value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'value'])),
                           'start' = c(1,rep(NA, (length(start.end$Type[start.end$Type == 'value'])-1))), 'end' = NA)
  for(a in value.seqs$no){
    value.seqs$end[a] <- sum(start.end$Length[1:((a*2)-1)])+1
    if(a < max(value.seqs$no))
      value.seqs$start[a+1] <- sum(start.end$Length[1:(a*2)])+1
  }
}

## Do not extrapolate outside of the time-range of the original dataframe
value.seqs$start.extr <- value.seqs$start - extrapol.length
value.seqs$start.extr[value.seqs$start.extr < 1] <- 1 # do not extrapolate below time < 1
value.seqs$end.extr <- value.seqs$end + extrapol.length
value.seqs$end.extr[value.seqs$end.extr > nrow(test) | is.na(value.seqs$end.extr)] <- nrow(test)
value.seqs$end[is.na(value.seqs$end)] <- max(which(!is.na(test[,2])))


## Extrapolate 
for(b in value.seqs$no){
  test[c(value.seqs$start.extr[b]:value.seqs$end.extr[b]),3] <- approxExtrap(value.seqs$start[b]:value.seqs$end[b],test[c(value.seqs$start[b]:value.seqs$end[b]),2],xout=c(value.seqs$start.extr[b]:value.seqs$end.extr[b]))[2]
}

感谢您的思考!