R中每六小时过滤一次日期时间

Filter date time every six hours in R

我是 R 的新手,我正在尝试过滤我的数据集以避免自动关联。我的数据集包含 25 只 GPS 项圈动物的 50.000 多个位置($ 经度和 $ 纬度),带有日期时间戳($ acquisition_time)和一些附加信息(年龄 class、性别、研究区域).我需要为每个人 ($animals_id) 过滤一组位置,仅包括采集时间最短的位置。相隔6小时。我会先按个人和 acquisition_time 对数据进行分组,但我不知道如何编写过滤函数。

这是我的数据集的一个子集:

     animals_id    acquisition_time longitude latitude      projection
8663         74 2018-02-17 03:00:24  6.426237 50.31815 EPSG:4326-WGS48
8664         74 2018-02-17 13:00:48  6.428196 50.31657 EPSG:4326-WGS48
8665         74 2018-02-17 18:00:24  6.423940 50.31833 EPSG:4326-WGS48
8666         74 2018-02-18 14:00:24  6.420372 50.31563 EPSG:4326-WGS48
8667         74 2018-02-18 19:00:54  6.420273 50.31534 EPSG:4326-WGS48
8668         74 2018-02-19 00:00:24  6.415756 50.31993 EPSG:4326-WGS48
8669         74 2018-02-19 20:00:24  6.415771 50.31927 EPSG:4326-WGS48
8670         78 2017-05-01 01:00:08  6.337308 50.26133 EPSG:4326-WGS48
8671         78 2017-05-01 06:00:23  6.345836 50.25292 EPSG:4326-WGS48
8672         78 2017-05-01 11:00:41  6.345818 50.25295 EPSG:4326-WGS48
8673         78 2017-05-01 16:00:23  6.345813 50.25287 EPSG:4326-WGS48
8674         78 2017-05-01 21:00:12  6.343215 50.25456 EPSG:4326-WGS48
8675         78 2017-05-02 02:00:23  6.342139 50.25576 EPSG:4326-WGS48
8676         78 2017-05-02 07:00:47  6.352676 50.25308 EPSG:4326-WGS48
     collar_type study_area_id animals_age_class animals_sex
8663         gps            15                 a           f
8664         gps            15                 a           f
8665         gps            15                 a           f
8666         gps            15                 a           f
8667         gps            15                 a           f
8668         gps            15                 a           f
8669         gps            15                 a           f
8670         gps            15                 a           f
8671         gps            15                 a           f
8672         gps            15                 a           f
8673         gps            15                 a           f
8674         gps            15                 a           f
8675         gps            15                 a           f
8676         gps            15                 a           f
> 

到目前为止我的代码:

data$acquisition_time = as.POSIXct(data$acquisition_time, tz = "UTC", format = "%Y-%m-%d %H:%M:%S")

filtered <- data %>% group_by(animals_id,acquisition_time) %>% filter()

我很感激每一个提示。

快速查看数据和采集时间间隔:

     animals_id    acquisition_time longitude latitude     hours
8663         74 2018-02-17 03:00:24  6.426237 50.31815  0.000000
8664         74 2018-02-17 13:00:48  6.428196 50.31657 10.006667
8665         74 2018-02-17 18:00:24  6.423940 50.31833  4.993333
8666         74 2018-02-18 14:00:24  6.420372 50.31563 20.000000
8667         74 2018-02-18 19:00:54  6.420273 50.31534  5.008333
8668         74 2018-02-19 00:00:24  6.415756 50.31993  4.991667
8669         74 2018-02-19 20:00:24  6.415771 50.31927 20.000000
8670         78 2017-05-01 01:00:08  6.337308 50.26133  0.000000
8671         78 2017-05-01 06:00:23  6.345836 50.25292  5.004167
8672         78 2017-05-01 11:00:41  6.345818 50.25295  5.005000
8673         78 2017-05-01 16:00:23  6.345813 50.25287  4.995000
8674         78 2017-05-01 21:00:12  6.343215 50.25456  4.996944
8675         78 2017-05-02 02:00:23  6.342139 50.25576  5.003056
8676         78 2017-05-02 07:00:47  6.352676 50.25308  5.006667

对我来说,这意味着对于 id 74,我们将删除第 8665 和 8667 行;对于 id 78,我们将删除第 8671、8673 和 8675 行。根据 animals_id.

,这样做将导致所有观察间隔不少于 6 小时

基础 R

func <- function(z, period = 6*3600) {
  if (length(z) < 2) return(rep(TRUE, length(z)))
  out <- TRUE
  ind <- 1
  while (ind < length(z)) {
    found <- which( (z[-seq_len(ind)] - z[ind]) >= period )
    if (!length(found)) {
      out <- c(out, rep(FALSE, length(z) - length(out)))
      break
    }
    out <- c(out, rep(FALSE, found[1] - 1), TRUE)
    ind <- ind + found[1]
  }
  out
}

dat[ave(as.numeric(dat$acquisition_time, units = "sec"), dat$animals_id, FUN = func) > 0,]
#      animals_id    acquisition_time longitude latitude
# 8663         74 2018-02-17 03:00:24  6.426237 50.31815
# 8664         74 2018-02-17 13:00:48  6.428196 50.31657
# 8666         74 2018-02-18 14:00:24  6.420372 50.31563
# 8668         74 2018-02-19 00:00:24  6.415756 50.31993
# 8669         74 2018-02-19 20:00:24  6.415771 50.31927
# 8670         78 2017-05-01 01:00:08  6.337308 50.26133
# 8672         78 2017-05-01 11:00:41  6.345818 50.25295
# 8674         78 2017-05-01 21:00:12  6.343215 50.25456
# 8676         78 2017-05-02 07:00:47  6.352676 50.25308

(注意:基数 R 的 ave 有一个主要限制,即所提供的 FUN 函数的 return 值必须与输入向量相同 class;当输入为 POSIXt 时,这会导致一些问题。为了缓解这些问题,我先发制人地将时间临时转换为 numeric 以调用 ave。这并不是所有组都需要的-在 base R 中总结函数,只是 ave,尽管它最适合这个目的。)

dplyr

library(dplyr)
dat %>%
  group_by(animals_id) %>%
  filter(func(acquisition_time)) %>%
  # not necessary, just here to show the resulting hours-between-times
  mutate(hours = c(0, diff(acquisition_time, units = "hours"))) %>%
  ungroup()
# # A tibble: 9 x 5
#   animals_id acquisition_time    longitude latitude hours
#        <int> <dttm>                  <dbl>    <dbl> <dbl>
# 1         74 2018-02-17 03:00:24      6.43     50.3  0   
# 2         74 2018-02-17 13:00:48      6.43     50.3 10.0 
# 3         74 2018-02-18 14:00:24      6.42     50.3 25.0 
# 4         74 2018-02-19 00:00:24      6.42     50.3 10   
# 5         74 2018-02-19 20:00:24      6.42     50.3 20   
# 6         78 2017-05-01 01:00:08      6.34     50.3  0   
# 7         78 2017-05-01 11:00:41      6.35     50.3 10.0 
# 8         78 2017-05-01 21:00:12      6.34     50.3  9.99
# 9         78 2017-05-02 07:00:47      6.35     50.3 10.0 

(注意 dplyr 删除了行名。我添加 hours 列只是为了演示产生的时间差异,生产中不需要它。)


数据:simplicity/MWE.

我只使用了上面数据中的前四个
dat <- structure(list(animals_id = c(74L, 74L, 74L, 74L, 74L, 74L, 74L, 78L, 78L, 78L, 78L, 78L, 78L, 78L), acquisition_time = structure(c(1518836424, 1518872448, 1518890424, 1518962424, 1518980454, 1518998424, 1519070424, 1493600408, 1493618423, 1493636441, 1493654423, 1493672412, 1493690423, 1493708447), class = c("POSIXct", "POSIXt"), tzone = "UTC"), longitude = c(6.426237, 6.428196, 6.42394, 6.420372, 6.420273, 6.415756, 6.415771, 6.337308, 6.345836, 6.345818, 6.345813, 6.343215, 6.342139, 6.352676), latitude = c(50.31815, 50.31657, 50.31833, 50.31563, 50.31534, 50.31993, 50.31927, 50.26133, 50.25292, 50.25295, 50.25287, 50.25456, 50.25576, 50.25308 )), row.names = c("8663", "8664", "8665", "8666", "8667", "8668", "8669", "8670", "8671", "8672", "8673", "8674", "8675", "8676"), class = "data.frame")