R中每六小时过滤一次日期时间
Filter date time every six hours in R
我是 R 的新手,我正在尝试过滤我的数据集以避免自动关联。我的数据集包含 25 只 GPS 项圈动物的 50.000 多个位置($ 经度和 $ 纬度),带有日期时间戳($ acquisition_time)和一些附加信息(年龄 class、性别、研究区域).我需要为每个人 ($animals_id) 过滤一组位置,仅包括采集时间最短的位置。相隔6小时。我会先按个人和 acquisition_time 对数据进行分组,但我不知道如何编写过滤函数。
这是我的数据集的一个子集:
animals_id acquisition_time longitude latitude projection
8663 74 2018-02-17 03:00:24 6.426237 50.31815 EPSG:4326-WGS48
8664 74 2018-02-17 13:00:48 6.428196 50.31657 EPSG:4326-WGS48
8665 74 2018-02-17 18:00:24 6.423940 50.31833 EPSG:4326-WGS48
8666 74 2018-02-18 14:00:24 6.420372 50.31563 EPSG:4326-WGS48
8667 74 2018-02-18 19:00:54 6.420273 50.31534 EPSG:4326-WGS48
8668 74 2018-02-19 00:00:24 6.415756 50.31993 EPSG:4326-WGS48
8669 74 2018-02-19 20:00:24 6.415771 50.31927 EPSG:4326-WGS48
8670 78 2017-05-01 01:00:08 6.337308 50.26133 EPSG:4326-WGS48
8671 78 2017-05-01 06:00:23 6.345836 50.25292 EPSG:4326-WGS48
8672 78 2017-05-01 11:00:41 6.345818 50.25295 EPSG:4326-WGS48
8673 78 2017-05-01 16:00:23 6.345813 50.25287 EPSG:4326-WGS48
8674 78 2017-05-01 21:00:12 6.343215 50.25456 EPSG:4326-WGS48
8675 78 2017-05-02 02:00:23 6.342139 50.25576 EPSG:4326-WGS48
8676 78 2017-05-02 07:00:47 6.352676 50.25308 EPSG:4326-WGS48
collar_type study_area_id animals_age_class animals_sex
8663 gps 15 a f
8664 gps 15 a f
8665 gps 15 a f
8666 gps 15 a f
8667 gps 15 a f
8668 gps 15 a f
8669 gps 15 a f
8670 gps 15 a f
8671 gps 15 a f
8672 gps 15 a f
8673 gps 15 a f
8674 gps 15 a f
8675 gps 15 a f
8676 gps 15 a f
>
到目前为止我的代码:
data$acquisition_time = as.POSIXct(data$acquisition_time, tz = "UTC", format = "%Y-%m-%d %H:%M:%S")
filtered <- data %>% group_by(animals_id,acquisition_time) %>% filter()
我很感激每一个提示。
快速查看数据和采集时间间隔:
animals_id acquisition_time longitude latitude hours
8663 74 2018-02-17 03:00:24 6.426237 50.31815 0.000000
8664 74 2018-02-17 13:00:48 6.428196 50.31657 10.006667
8665 74 2018-02-17 18:00:24 6.423940 50.31833 4.993333
8666 74 2018-02-18 14:00:24 6.420372 50.31563 20.000000
8667 74 2018-02-18 19:00:54 6.420273 50.31534 5.008333
8668 74 2018-02-19 00:00:24 6.415756 50.31993 4.991667
8669 74 2018-02-19 20:00:24 6.415771 50.31927 20.000000
8670 78 2017-05-01 01:00:08 6.337308 50.26133 0.000000
8671 78 2017-05-01 06:00:23 6.345836 50.25292 5.004167
8672 78 2017-05-01 11:00:41 6.345818 50.25295 5.005000
8673 78 2017-05-01 16:00:23 6.345813 50.25287 4.995000
8674 78 2017-05-01 21:00:12 6.343215 50.25456 4.996944
8675 78 2017-05-02 02:00:23 6.342139 50.25576 5.003056
8676 78 2017-05-02 07:00:47 6.352676 50.25308 5.006667
对我来说,这意味着对于 id 74
,我们将删除第 8665 和 8667 行;对于 id 78
,我们将删除第 8671、8673 和 8675 行。根据 animals_id
.
,这样做将导致所有观察间隔不少于 6 小时
基础 R
func <- function(z, period = 6*3600) {
if (length(z) < 2) return(rep(TRUE, length(z)))
out <- TRUE
ind <- 1
while (ind < length(z)) {
found <- which( (z[-seq_len(ind)] - z[ind]) >= period )
if (!length(found)) {
out <- c(out, rep(FALSE, length(z) - length(out)))
break
}
out <- c(out, rep(FALSE, found[1] - 1), TRUE)
ind <- ind + found[1]
}
out
}
dat[ave(as.numeric(dat$acquisition_time, units = "sec"), dat$animals_id, FUN = func) > 0,]
# animals_id acquisition_time longitude latitude
# 8663 74 2018-02-17 03:00:24 6.426237 50.31815
# 8664 74 2018-02-17 13:00:48 6.428196 50.31657
# 8666 74 2018-02-18 14:00:24 6.420372 50.31563
# 8668 74 2018-02-19 00:00:24 6.415756 50.31993
# 8669 74 2018-02-19 20:00:24 6.415771 50.31927
# 8670 78 2017-05-01 01:00:08 6.337308 50.26133
# 8672 78 2017-05-01 11:00:41 6.345818 50.25295
# 8674 78 2017-05-01 21:00:12 6.343215 50.25456
# 8676 78 2017-05-02 07:00:47 6.352676 50.25308
(注意:基数 R 的 ave
有一个主要限制,即所提供的 FUN
函数的 return 值必须与输入向量相同 class;当输入为 POSIXt
时,这会导致一些问题。为了缓解这些问题,我先发制人地将时间临时转换为 numeric
以调用 ave
。这并不是所有组都需要的-在 base R 中总结函数,只是 ave
,尽管它最适合这个目的。)
dplyr
library(dplyr)
dat %>%
group_by(animals_id) %>%
filter(func(acquisition_time)) %>%
# not necessary, just here to show the resulting hours-between-times
mutate(hours = c(0, diff(acquisition_time, units = "hours"))) %>%
ungroup()
# # A tibble: 9 x 5
# animals_id acquisition_time longitude latitude hours
# <int> <dttm> <dbl> <dbl> <dbl>
# 1 74 2018-02-17 03:00:24 6.43 50.3 0
# 2 74 2018-02-17 13:00:48 6.43 50.3 10.0
# 3 74 2018-02-18 14:00:24 6.42 50.3 25.0
# 4 74 2018-02-19 00:00:24 6.42 50.3 10
# 5 74 2018-02-19 20:00:24 6.42 50.3 20
# 6 78 2017-05-01 01:00:08 6.34 50.3 0
# 7 78 2017-05-01 11:00:41 6.35 50.3 10.0
# 8 78 2017-05-01 21:00:12 6.34 50.3 9.99
# 9 78 2017-05-02 07:00:47 6.35 50.3 10.0
(注意 dplyr
删除了行名。我添加 hours
列只是为了演示产生的时间差异,生产中不需要它。)
数据:simplicity/MWE.
我只使用了上面数据中的前四个
dat <- structure(list(animals_id = c(74L, 74L, 74L, 74L, 74L, 74L, 74L, 78L, 78L, 78L, 78L, 78L, 78L, 78L), acquisition_time = structure(c(1518836424, 1518872448, 1518890424, 1518962424, 1518980454, 1518998424, 1519070424, 1493600408, 1493618423, 1493636441, 1493654423, 1493672412, 1493690423, 1493708447), class = c("POSIXct", "POSIXt"), tzone = "UTC"), longitude = c(6.426237, 6.428196, 6.42394, 6.420372, 6.420273, 6.415756, 6.415771, 6.337308, 6.345836, 6.345818, 6.345813, 6.343215, 6.342139, 6.352676), latitude = c(50.31815, 50.31657, 50.31833, 50.31563, 50.31534, 50.31993, 50.31927, 50.26133, 50.25292, 50.25295, 50.25287, 50.25456, 50.25576, 50.25308 )), row.names = c("8663", "8664", "8665", "8666", "8667", "8668", "8669", "8670", "8671", "8672", "8673", "8674", "8675", "8676"), class = "data.frame")
我是 R 的新手,我正在尝试过滤我的数据集以避免自动关联。我的数据集包含 25 只 GPS 项圈动物的 50.000 多个位置($ 经度和 $ 纬度),带有日期时间戳($ acquisition_time)和一些附加信息(年龄 class、性别、研究区域).我需要为每个人 ($animals_id) 过滤一组位置,仅包括采集时间最短的位置。相隔6小时。我会先按个人和 acquisition_time 对数据进行分组,但我不知道如何编写过滤函数。
这是我的数据集的一个子集:
animals_id acquisition_time longitude latitude projection
8663 74 2018-02-17 03:00:24 6.426237 50.31815 EPSG:4326-WGS48
8664 74 2018-02-17 13:00:48 6.428196 50.31657 EPSG:4326-WGS48
8665 74 2018-02-17 18:00:24 6.423940 50.31833 EPSG:4326-WGS48
8666 74 2018-02-18 14:00:24 6.420372 50.31563 EPSG:4326-WGS48
8667 74 2018-02-18 19:00:54 6.420273 50.31534 EPSG:4326-WGS48
8668 74 2018-02-19 00:00:24 6.415756 50.31993 EPSG:4326-WGS48
8669 74 2018-02-19 20:00:24 6.415771 50.31927 EPSG:4326-WGS48
8670 78 2017-05-01 01:00:08 6.337308 50.26133 EPSG:4326-WGS48
8671 78 2017-05-01 06:00:23 6.345836 50.25292 EPSG:4326-WGS48
8672 78 2017-05-01 11:00:41 6.345818 50.25295 EPSG:4326-WGS48
8673 78 2017-05-01 16:00:23 6.345813 50.25287 EPSG:4326-WGS48
8674 78 2017-05-01 21:00:12 6.343215 50.25456 EPSG:4326-WGS48
8675 78 2017-05-02 02:00:23 6.342139 50.25576 EPSG:4326-WGS48
8676 78 2017-05-02 07:00:47 6.352676 50.25308 EPSG:4326-WGS48
collar_type study_area_id animals_age_class animals_sex
8663 gps 15 a f
8664 gps 15 a f
8665 gps 15 a f
8666 gps 15 a f
8667 gps 15 a f
8668 gps 15 a f
8669 gps 15 a f
8670 gps 15 a f
8671 gps 15 a f
8672 gps 15 a f
8673 gps 15 a f
8674 gps 15 a f
8675 gps 15 a f
8676 gps 15 a f
>
到目前为止我的代码:
data$acquisition_time = as.POSIXct(data$acquisition_time, tz = "UTC", format = "%Y-%m-%d %H:%M:%S")
filtered <- data %>% group_by(animals_id,acquisition_time) %>% filter()
我很感激每一个提示。
快速查看数据和采集时间间隔:
animals_id acquisition_time longitude latitude hours
8663 74 2018-02-17 03:00:24 6.426237 50.31815 0.000000
8664 74 2018-02-17 13:00:48 6.428196 50.31657 10.006667
8665 74 2018-02-17 18:00:24 6.423940 50.31833 4.993333
8666 74 2018-02-18 14:00:24 6.420372 50.31563 20.000000
8667 74 2018-02-18 19:00:54 6.420273 50.31534 5.008333
8668 74 2018-02-19 00:00:24 6.415756 50.31993 4.991667
8669 74 2018-02-19 20:00:24 6.415771 50.31927 20.000000
8670 78 2017-05-01 01:00:08 6.337308 50.26133 0.000000
8671 78 2017-05-01 06:00:23 6.345836 50.25292 5.004167
8672 78 2017-05-01 11:00:41 6.345818 50.25295 5.005000
8673 78 2017-05-01 16:00:23 6.345813 50.25287 4.995000
8674 78 2017-05-01 21:00:12 6.343215 50.25456 4.996944
8675 78 2017-05-02 02:00:23 6.342139 50.25576 5.003056
8676 78 2017-05-02 07:00:47 6.352676 50.25308 5.006667
对我来说,这意味着对于 id 74
,我们将删除第 8665 和 8667 行;对于 id 78
,我们将删除第 8671、8673 和 8675 行。根据 animals_id
.
基础 R
func <- function(z, period = 6*3600) {
if (length(z) < 2) return(rep(TRUE, length(z)))
out <- TRUE
ind <- 1
while (ind < length(z)) {
found <- which( (z[-seq_len(ind)] - z[ind]) >= period )
if (!length(found)) {
out <- c(out, rep(FALSE, length(z) - length(out)))
break
}
out <- c(out, rep(FALSE, found[1] - 1), TRUE)
ind <- ind + found[1]
}
out
}
dat[ave(as.numeric(dat$acquisition_time, units = "sec"), dat$animals_id, FUN = func) > 0,]
# animals_id acquisition_time longitude latitude
# 8663 74 2018-02-17 03:00:24 6.426237 50.31815
# 8664 74 2018-02-17 13:00:48 6.428196 50.31657
# 8666 74 2018-02-18 14:00:24 6.420372 50.31563
# 8668 74 2018-02-19 00:00:24 6.415756 50.31993
# 8669 74 2018-02-19 20:00:24 6.415771 50.31927
# 8670 78 2017-05-01 01:00:08 6.337308 50.26133
# 8672 78 2017-05-01 11:00:41 6.345818 50.25295
# 8674 78 2017-05-01 21:00:12 6.343215 50.25456
# 8676 78 2017-05-02 07:00:47 6.352676 50.25308
(注意:基数 R 的 ave
有一个主要限制,即所提供的 FUN
函数的 return 值必须与输入向量相同 class;当输入为 POSIXt
时,这会导致一些问题。为了缓解这些问题,我先发制人地将时间临时转换为 numeric
以调用 ave
。这并不是所有组都需要的-在 base R 中总结函数,只是 ave
,尽管它最适合这个目的。)
dplyr
library(dplyr)
dat %>%
group_by(animals_id) %>%
filter(func(acquisition_time)) %>%
# not necessary, just here to show the resulting hours-between-times
mutate(hours = c(0, diff(acquisition_time, units = "hours"))) %>%
ungroup()
# # A tibble: 9 x 5
# animals_id acquisition_time longitude latitude hours
# <int> <dttm> <dbl> <dbl> <dbl>
# 1 74 2018-02-17 03:00:24 6.43 50.3 0
# 2 74 2018-02-17 13:00:48 6.43 50.3 10.0
# 3 74 2018-02-18 14:00:24 6.42 50.3 25.0
# 4 74 2018-02-19 00:00:24 6.42 50.3 10
# 5 74 2018-02-19 20:00:24 6.42 50.3 20
# 6 78 2017-05-01 01:00:08 6.34 50.3 0
# 7 78 2017-05-01 11:00:41 6.35 50.3 10.0
# 8 78 2017-05-01 21:00:12 6.34 50.3 9.99
# 9 78 2017-05-02 07:00:47 6.35 50.3 10.0
(注意 dplyr
删除了行名。我添加 hours
列只是为了演示产生的时间差异,生产中不需要它。)
数据:simplicity/MWE.
我只使用了上面数据中的前四个dat <- structure(list(animals_id = c(74L, 74L, 74L, 74L, 74L, 74L, 74L, 78L, 78L, 78L, 78L, 78L, 78L, 78L), acquisition_time = structure(c(1518836424, 1518872448, 1518890424, 1518962424, 1518980454, 1518998424, 1519070424, 1493600408, 1493618423, 1493636441, 1493654423, 1493672412, 1493690423, 1493708447), class = c("POSIXct", "POSIXt"), tzone = "UTC"), longitude = c(6.426237, 6.428196, 6.42394, 6.420372, 6.420273, 6.415756, 6.415771, 6.337308, 6.345836, 6.345818, 6.345813, 6.343215, 6.342139, 6.352676), latitude = c(50.31815, 50.31657, 50.31833, 50.31563, 50.31534, 50.31993, 50.31927, 50.26133, 50.25292, 50.25295, 50.25287, 50.25456, 50.25576, 50.25308 )), row.names = c("8663", "8664", "8665", "8666", "8667", "8668", "8669", "8670", "8671", "8672", "8673", "8674", "8675", "8676"), class = "data.frame")