将缺失的行插入并插入数据框 R
Interpolate and insert missing rows into dataframe R
我正在处理大量需要大量操作的海洋数据时间序列。
我丢失了几天的数据,想对它们进行插值。具体来说 date/depth/temperature。
这是我的 df 的示例:
> tibble(df)
# A tibble: 351,685 x 9
date time depthR SV temp salinity conduct density calcSV
<date> <times> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-11-17 07:50:18 0.5 1524. 19.7 37.8 51.0 27 1524.
2 2021-11-17 07:50:22 0.5 1524. 19.9 37.6 50.9 26.8 1524.
3 2021-11-17 07:50:23 1.1 1524. 19.9 37.6 50.9 26.8 1524.
4 2021-11-17 07:50:24 1.5 1524. 19.9 37.6 50.9 26.8 1524.
5 2021-11-17 07:50:25 2 1524. 19.9 37.6 50.9 26.8 1524.
每个日期都包含 1000 多行数据,因此我的想法是找到每一天的最大深度,从而为两者之间缺失的日子插入合理的最大深度值。
到目前为止,我已经找到了每个日期的最大深度:
group <- df %>% group_by(date) %>% summarise(max =max(depthR, na.rm=TRUE))
> tibble(group)
# A tibble: 40 x 2
date max
<date> <dbl>
1 2021-11-17 685.
2 2021-11-18 695.
3 2021-11-19 136.
4 2021-11-20 138.
5 2021-11-21 142.
6 2021-11-22 26
7 2021-11-23 136.
8 2021-11-24 297.
9 2021-11-25 613.
10 2021-11-26 81.1
# ... with 30 more rows
然后我设法通过以下方式插入缺失的日期:
> group <- seq(min(group$date), max(group$date), by = "1 day")
> group <- data.frame(date=group)
> tibble(group)
# A tibble: 69 x 1
date
<date>
1 2021-11-17
2 2021-11-18
3 2021-11-19
4 2021-11-20
5 2021-11-21
6 2021-11-22
7 2021-11-23
8 2021-11-24
9 2021-11-25
10 2021-11-26
# ... with 59 more rows
如您所见,先前的查询已被覆盖。
所以我尝试为插值日期创建一个新的 df 并尝试将它们合并在一起。我收到错误:
> library(stringr)
> group$combined <- str_c(group$date, '', dateinterp$date)
Error: Assigned data `str_c(group$date, "", dateinterp$date)` must be compatible with existing data.
x Existing data has 40 rows.
x Assigned data has 69 rows.
i Only vectors of size 1 are recycled.
如何在不覆盖原始数据或不冲突的情况下,将这两个不同长度的矩阵按时间顺序插入数据帧?
之后,我不确定我将如何为每个日期插入深度和温度。
也许从以下内容开始:
depth = seq(1, 200, length.out = 100))
最终日期变量将被交换为地理坐标。
非常感谢任何建议。
编辑:根据@AndreaM 的要求,我的数据示例:
> dput(head(df))
structure(list(date = structure(c(18948, 18948, 18948, 18948,
18948, 18948), class = "Date"), time = structure(c(0.326597222222222,
0.326643518518519, 0.326655092592593, 0.326666666666667, 0.326678240740741,
0.326712962962963), format = "h:m:s", class = "times"), depth = c(0.5,
0.5, 1.1, 1.5, 2, 2.5), SV = c(1524.024, 1524.026, 1524.025,
1524.008, 1524.016, 1524.084), temp = c(19.697, 19.864, 19.852,
19.854, 19.856, 19.847), salinity = c(37.823, 37.561, 37.557,
37.568, 37.573, 37.704), conduct = c(51.012, 50.878, 50.86, 50.876,
50.884, 51.032), density = c(27, 26.755, 26.758, 26.768, 26.773,
26.877), calcSV = c(1523.811, 1523.978, 1523.949, 1523.975, 1523.993,
1524.124)), row.names = 100838:100843, class = "data.frame")
一种方法,根据您的情况适当调整:
library(dplyr)
library(lubridate) ## facilitates date-time manipulations
## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01') + sample(1:10, 6),
value = rnorm(12)) %>%
arrange(date)
## create vector of -only!- missing dates:
missing_dates <-
setdiff(
seq.Date(from = min(patchy_data$date),
to = max(patchy_data$date),
by = '1 day'
),
patchy_data$date
) %>% as.Date(origin = '1970-01-01')
## extend initial dataframe with rows per missing date:
full_data <-
patchy_data %>%
bind_rows(data.frame(date = missing_dates,
value = NA)
) %>%
arrange(date)
## group by month and impute missing data from monthwise statistic:
full_data %>%
mutate(month = lubridate::month(date)) %>%
group_by(month) %>%
## coalesce conveniently replaces ifelse-constructs to replace NAs
mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))
编辑
使用附加参数(例如测量深度)对生成的数据(缺失日期)进行粒度化的一种可能性是使用 expand.grid
,如下所示。假设来自先前代码的对象名称:
## depths of daily measurements:
observation_depths <- c(0.5, 1.1, 1.5) ## example
## generate dataframe with missing dates x depths:
missing_dates_and_depths <-
setNames(expand.grid(missing_dates, observation_depths),
c('date','depthR')
)
## stack both dataframes as above:
full_data <-
patchy_data %>%
bind_rows(missing_dates_and_depths) %>%
arrange(date)
我正在处理大量需要大量操作的海洋数据时间序列。 我丢失了几天的数据,想对它们进行插值。具体来说 date/depth/temperature。 这是我的 df 的示例:
> tibble(df)
# A tibble: 351,685 x 9
date time depthR SV temp salinity conduct density calcSV
<date> <times> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-11-17 07:50:18 0.5 1524. 19.7 37.8 51.0 27 1524.
2 2021-11-17 07:50:22 0.5 1524. 19.9 37.6 50.9 26.8 1524.
3 2021-11-17 07:50:23 1.1 1524. 19.9 37.6 50.9 26.8 1524.
4 2021-11-17 07:50:24 1.5 1524. 19.9 37.6 50.9 26.8 1524.
5 2021-11-17 07:50:25 2 1524. 19.9 37.6 50.9 26.8 1524.
每个日期都包含 1000 多行数据,因此我的想法是找到每一天的最大深度,从而为两者之间缺失的日子插入合理的最大深度值。 到目前为止,我已经找到了每个日期的最大深度:
group <- df %>% group_by(date) %>% summarise(max =max(depthR, na.rm=TRUE))
> tibble(group)
# A tibble: 40 x 2
date max
<date> <dbl>
1 2021-11-17 685.
2 2021-11-18 695.
3 2021-11-19 136.
4 2021-11-20 138.
5 2021-11-21 142.
6 2021-11-22 26
7 2021-11-23 136.
8 2021-11-24 297.
9 2021-11-25 613.
10 2021-11-26 81.1
# ... with 30 more rows
然后我设法通过以下方式插入缺失的日期:
> group <- seq(min(group$date), max(group$date), by = "1 day")
> group <- data.frame(date=group)
> tibble(group)
# A tibble: 69 x 1
date
<date>
1 2021-11-17
2 2021-11-18
3 2021-11-19
4 2021-11-20
5 2021-11-21
6 2021-11-22
7 2021-11-23
8 2021-11-24
9 2021-11-25
10 2021-11-26
# ... with 59 more rows
如您所见,先前的查询已被覆盖。 所以我尝试为插值日期创建一个新的 df 并尝试将它们合并在一起。我收到错误:
> library(stringr)
> group$combined <- str_c(group$date, '', dateinterp$date)
Error: Assigned data `str_c(group$date, "", dateinterp$date)` must be compatible with existing data.
x Existing data has 40 rows.
x Assigned data has 69 rows.
i Only vectors of size 1 are recycled.
如何在不覆盖原始数据或不冲突的情况下,将这两个不同长度的矩阵按时间顺序插入数据帧? 之后,我不确定我将如何为每个日期插入深度和温度。 也许从以下内容开始:
depth = seq(1, 200, length.out = 100))
最终日期变量将被交换为地理坐标。 非常感谢任何建议。
编辑:根据@AndreaM 的要求,我的数据示例:
> dput(head(df))
structure(list(date = structure(c(18948, 18948, 18948, 18948,
18948, 18948), class = "Date"), time = structure(c(0.326597222222222,
0.326643518518519, 0.326655092592593, 0.326666666666667, 0.326678240740741,
0.326712962962963), format = "h:m:s", class = "times"), depth = c(0.5,
0.5, 1.1, 1.5, 2, 2.5), SV = c(1524.024, 1524.026, 1524.025,
1524.008, 1524.016, 1524.084), temp = c(19.697, 19.864, 19.852,
19.854, 19.856, 19.847), salinity = c(37.823, 37.561, 37.557,
37.568, 37.573, 37.704), conduct = c(51.012, 50.878, 50.86, 50.876,
50.884, 51.032), density = c(27, 26.755, 26.758, 26.768, 26.773,
26.877), calcSV = c(1523.811, 1523.978, 1523.949, 1523.975, 1523.993,
1524.124)), row.names = 100838:100843, class = "data.frame")
一种方法,根据您的情况适当调整:
library(dplyr)
library(lubridate) ## facilitates date-time manipulations
## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01') + sample(1:10, 6),
value = rnorm(12)) %>%
arrange(date)
## create vector of -only!- missing dates:
missing_dates <-
setdiff(
seq.Date(from = min(patchy_data$date),
to = max(patchy_data$date),
by = '1 day'
),
patchy_data$date
) %>% as.Date(origin = '1970-01-01')
## extend initial dataframe with rows per missing date:
full_data <-
patchy_data %>%
bind_rows(data.frame(date = missing_dates,
value = NA)
) %>%
arrange(date)
## group by month and impute missing data from monthwise statistic:
full_data %>%
mutate(month = lubridate::month(date)) %>%
group_by(month) %>%
## coalesce conveniently replaces ifelse-constructs to replace NAs
mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))
编辑
使用附加参数(例如测量深度)对生成的数据(缺失日期)进行粒度化的一种可能性是使用 expand.grid
,如下所示。假设来自先前代码的对象名称:
## depths of daily measurements:
observation_depths <- c(0.5, 1.1, 1.5) ## example
## generate dataframe with missing dates x depths:
missing_dates_and_depths <-
setNames(expand.grid(missing_dates, observation_depths),
c('date','depthR')
)
## stack both dataframes as above:
full_data <-
patchy_data %>%
bind_rows(missing_dates_and_depths) %>%
arrange(date)