将缺失的行插入并插入数据框 R

Question

我正在处理大量需要大量操作的海洋数据时间序列。我丢失了几天的数据，想对它们进行插值。具体来说 date/depth/temperature。这是我的 df 的示例：

> tibble(df)
# A tibble: 351,685 x 9
   date       time     depthR    SV  temp salinity conduct density calcSV
   <date>     <times>  <dbl> <dbl> <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
 1 2021-11-17 07:50:18   0.5 1524.  19.7     37.8    51.0    27    1524.
 2 2021-11-17 07:50:22   0.5 1524.  19.9     37.6    50.9    26.8  1524.
 3 2021-11-17 07:50:23   1.1 1524.  19.9     37.6    50.9    26.8  1524.
 4 2021-11-17 07:50:24   1.5 1524.  19.9     37.6    50.9    26.8  1524.
 5 2021-11-17 07:50:25   2   1524.  19.9     37.6    50.9    26.8  1524.

每个日期都包含 1000 多行数据，因此我的想法是找到每一天的最大深度，从而为两者之间缺失的日子插入合理的最大深度值。到目前为止，我已经找到了每个日期的最大深度：

group <- df %>% group_by(date) %>% summarise(max =max(depthR, na.rm=TRUE))
> tibble(group)
# A tibble: 40 x 2
   date         max
   <date>     <dbl>
 1 2021-11-17 685. 
 2 2021-11-18 695. 
 3 2021-11-19 136. 
 4 2021-11-20 138. 
 5 2021-11-21 142. 
 6 2021-11-22  26  
 7 2021-11-23 136. 
 8 2021-11-24 297. 
 9 2021-11-25 613. 
10 2021-11-26  81.1
# ... with 30 more rows

然后我设法通过以下方式插入缺失的日期：

> group <- seq(min(group$date), max(group$date), by = "1 day")
> group <- data.frame(date=group)
> tibble(group)
# A tibble: 69 x 1
   date      
   <date>    
 1 2021-11-17
 2 2021-11-18
 3 2021-11-19
 4 2021-11-20
 5 2021-11-21
 6 2021-11-22
 7 2021-11-23
 8 2021-11-24
 9 2021-11-25
10 2021-11-26
# ... with 59 more rows

如您所见，先前的查询已被覆盖。所以我尝试为插值日期创建一个新的 df 并尝试将它们合并在一起。我收到错误：

> library(stringr)
> group$combined <-  str_c(group$date, '', dateinterp$date)
Error: Assigned data `str_c(group$date, "", dateinterp$date)` must be compatible with existing data.
x Existing data has 40 rows.
x Assigned data has 69 rows.
i Only vectors of size 1 are recycled.

如何在不覆盖原始数据或不冲突的情况下，将这两个不同长度的矩阵按时间顺序插入数据帧？之后，我不确定我将如何为每个日期插入深度和温度。也许从以下内容开始：

depth = seq(1, 200, length.out = 100))

最终日期变量将被交换为地理坐标。非常感谢任何建议。

编辑：根据@AndreaM 的要求，我的数据示例：

> dput(head(df))
structure(list(date = structure(c(18948, 18948, 18948, 18948, 
18948, 18948), class = "Date"), time = structure(c(0.326597222222222, 
0.326643518518519, 0.326655092592593, 0.326666666666667, 0.326678240740741, 
0.326712962962963), format = "h:m:s", class = "times"), depth = c(0.5, 
0.5, 1.1, 1.5, 2, 2.5), SV = c(1524.024, 1524.026, 1524.025, 
1524.008, 1524.016, 1524.084), temp = c(19.697, 19.864, 19.852, 
19.854, 19.856, 19.847), salinity = c(37.823, 37.561, 37.557, 
37.568, 37.573, 37.704), conduct = c(51.012, 50.878, 50.86, 50.876, 
50.884, 51.032), density = c(27, 26.755, 26.758, 26.768, 26.773, 
26.877), calcSV = c(1523.811, 1523.978, 1523.949, 1523.975, 1523.993, 
1524.124)), row.names = 100838:100843, class = "data.frame")

Answer 1

一种方法，根据您的情况适当调整：

library(dplyr)
library(lubridate) ## facilitates date-time manipulations

## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01') + sample(1:10, 6),
                          value = rnorm(12)) %>%
    arrange(date)

## create vector of -only!- missing dates:
missing_dates <- 
    setdiff(
        seq.Date(from = min(patchy_data$date),
                 to = max(patchy_data$date),
                 by = '1 day'
                 ),
        patchy_data$date
    ) %>% as.Date(origin = '1970-01-01')

## extend initial dataframe with rows per missing date:
full_data <-
    patchy_data %>%
        bind_rows(data.frame(date = missing_dates,
                             value = NA)
                  ) %>%
        arrange(date)

## group by month and impute missing data from monthwise statistic:
full_data %>%
    mutate(month = lubridate::month(date)) %>%
    group_by(month) %>%
    ## coalesce conveniently replaces ifelse-constructs to replace NAs
    mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))

编辑使用附加参数（例如测量深度）对生成的数据（缺失日期）进行粒度化的一种可能性是使用 expand.grid，如下所示。假设来自先前代码的对象名称：

## depths of daily measurements:
observation_depths <- c(0.5, 1.1, 1.5) ## example

## generate dataframe with missing dates x depths:
missing_dates_and_depths  <- 
    setNames(expand.grid(missing_dates, observation_depths),
             c('date','depthR')
             )


## stack both dataframes as above:
full_data <-
    patchy_data %>%
        bind_rows(missing_dates_and_depths) %>%
        arrange(date)

将缺失的行插入并插入数据框 R

Interpolate and insert missing rows into dataframe R

interpolation

r

time-series

dataframe

sequencing