R 函数,用于选择每 hour\day 个均匀分开的 x 个观测值
R function for selecting an x amount of observations per hour\day that are evenly separated
我正在尝试创建一个函数,该函数 select 在定义的时间范围内进行定义数量的观察。
我已经设法创建了一个函数,该函数每小时对一次观察进行子集化:
#create example df
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))
#function
OnePerHour <- function(df){
dataOnePerHour <- df %>%
group_by(Animal_ID, hour(timestamp), as.Date(timestamp))%>%
filter(row_number(Animal_ID) == 1)
return(dataOnePerHour)
}
但是,我无法集中精力扩展,所以我可以 select 更多 obs/hour 均匀分布。
在这个例子中,每分钟都有一次观察,但在“真实数据集”中,可能只有三到四个 observations/hr,一只动物间隔 15 分钟,另一只动物每秒观察一次。因此,假设我正在寻找 3 obs\hr,并且观察频率为 1/分钟,因此 1、21、41 正是我正在寻找的。如果只有三个(相隔 15 分钟),我想包括所有观察结果。
任何帮助将不胜感激。
伊丹
如果我理解正确的话,我可能会做这样的事情。可能有点长,我会加上日期,小时,分钟,先计算每个动物ID与上一个时间点的时差。
然后计算一个小时的观察次数,根据你的描述创建过滤逻辑列。
df <- df %>%
mutate(dt = as.Date(timestamp),
hr = hour(timestamp),
m = minute(timestamp)) %>%
group_by(Animal_ID) %>%
mutate(time_diff = m-lag(m))
df <- df %>%
group_by(Animal_ID,
dt,
hr) %>%
mutate(num_in_hour = n(),
filterlogic = (num_in_hour == 60 & m %in% c(1,21,41))|(num_in_hour %in% c(3,4)&time_diff==15)
) %>%
filter(filterlogic == TRUE)
这是一个解决方案,它在每小时内为每个 Animal_ID
创建 times_per_hour
等间隔,然后选择该间隔内的第一个观察值。但是,如果该时间间隔内没有任何观测值,则不会选择任何观测值。因此,如果您想要每小时 3 个,并且您在 12:01、12:02 和 12:03 处有观测值,您只会得到第一个,因为 [=] 之间没有观测值23=]-12:40 或 12:40-1:00.
library(dplyr)
library(tidyr)
library(lubridate)
#create example df
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))
get_observations <- function(df, times_per_hour, min_date_time, max_date_time) {
# madke dataframe with all possible minutes between min and max times
timespan <- expand_grid(Animal_ID = unique(df$Animal_ID),
# replace with min and max datetimes of the data
timestamp = seq(min_date_time, max_date_time, "min"))
ideal_times <- timespan %>%
group_by(Animal_ID, hour = hour(timestamp), date = as.Date(timestamp)) %>%
# select the beginning of the interval from which you want an observation
slice(seq(1, n(), by = 60/times_per_hour)) %>%
mutate(time_interval = interval(timestamp,
lead(timestamp, default = max_date_time))) %>%
select(-timestamp)
df %>%
mutate(hour = hour(timestamp), date = as.Date(timestamp)) %>%
# join so every time interval is matched with all the obs in that hour
right_join(ideal_times, by = c("Animal_ID", "hour", "date")) %>%
# then remove all the obs that aren't in the exact interval
filter(as_datetime(timestamp) %within% time_interval) %>%
group_by(Animal_ID, time_interval) %>%
# then take the first observation
slice(1) %>%
ungroup() %>%
select(-time_interval)
}
# choose 10% so that observations are not equally spaced
sample_df <- slice_sample(df, prop = .1)
get_observations(sample_df, times_per_hour = 3,
min_date_time = ISOdate(2022,05,20), max_date_time = ISOdate(2022,05,22))
#> # A tibble: 259 × 4
#> Animal_ID timestamp hour date
#> <chr> <chr> <int> <date>
#> 1 Avi 2022-05-20 12:00:00 12 2022-05-20
#> 2 Avi 2022-05-20 12:32:00 12 2022-05-20
#> 3 Avi 2022-05-20 12:48:00 12 2022-05-20
#> 4 Avi 2022-05-20 13:15:00 13 2022-05-20
#> 5 Avi 2022-05-20 13:35:00 13 2022-05-20
#> 6 Avi 2022-05-20 13:52:00 13 2022-05-20
#> 7 Avi 2022-05-20 14:17:00 14 2022-05-20
#> 8 Avi 2022-05-20 14:28:00 14 2022-05-20
#> 9 Avi 2022-05-20 14:48:00 14 2022-05-20
#> 10 Avi 2022-05-20 15:16:00 15 2022-05-20
#> # … with 249 more rows
由 reprex package (v2.0.1)
于 2022-05-23 创建
我正在尝试创建一个函数,该函数 select 在定义的时间范围内进行定义数量的观察。
我已经设法创建了一个函数,该函数每小时对一次观察进行子集化:
#create example df
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))
#function
OnePerHour <- function(df){
dataOnePerHour <- df %>%
group_by(Animal_ID, hour(timestamp), as.Date(timestamp))%>%
filter(row_number(Animal_ID) == 1)
return(dataOnePerHour)
}
但是,我无法集中精力扩展,所以我可以 select 更多 obs/hour 均匀分布。
在这个例子中,每分钟都有一次观察,但在“真实数据集”中,可能只有三到四个 observations/hr,一只动物间隔 15 分钟,另一只动物每秒观察一次。因此,假设我正在寻找 3 obs\hr,并且观察频率为 1/分钟,因此 1、21、41 正是我正在寻找的。如果只有三个(相隔 15 分钟),我想包括所有观察结果。
任何帮助将不胜感激。
伊丹
如果我理解正确的话,我可能会做这样的事情。可能有点长,我会加上日期,小时,分钟,先计算每个动物ID与上一个时间点的时差。
然后计算一个小时的观察次数,根据你的描述创建过滤逻辑列。
df <- df %>%
mutate(dt = as.Date(timestamp),
hr = hour(timestamp),
m = minute(timestamp)) %>%
group_by(Animal_ID) %>%
mutate(time_diff = m-lag(m))
df <- df %>%
group_by(Animal_ID,
dt,
hr) %>%
mutate(num_in_hour = n(),
filterlogic = (num_in_hour == 60 & m %in% c(1,21,41))|(num_in_hour %in% c(3,4)&time_diff==15)
) %>%
filter(filterlogic == TRUE)
这是一个解决方案,它在每小时内为每个 Animal_ID
创建 times_per_hour
等间隔,然后选择该间隔内的第一个观察值。但是,如果该时间间隔内没有任何观测值,则不会选择任何观测值。因此,如果您想要每小时 3 个,并且您在 12:01、12:02 和 12:03 处有观测值,您只会得到第一个,因为 [=] 之间没有观测值23=]-12:40 或 12:40-1:00.
library(dplyr)
library(tidyr)
library(lubridate)
#create example df
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))
get_observations <- function(df, times_per_hour, min_date_time, max_date_time) {
# madke dataframe with all possible minutes between min and max times
timespan <- expand_grid(Animal_ID = unique(df$Animal_ID),
# replace with min and max datetimes of the data
timestamp = seq(min_date_time, max_date_time, "min"))
ideal_times <- timespan %>%
group_by(Animal_ID, hour = hour(timestamp), date = as.Date(timestamp)) %>%
# select the beginning of the interval from which you want an observation
slice(seq(1, n(), by = 60/times_per_hour)) %>%
mutate(time_interval = interval(timestamp,
lead(timestamp, default = max_date_time))) %>%
select(-timestamp)
df %>%
mutate(hour = hour(timestamp), date = as.Date(timestamp)) %>%
# join so every time interval is matched with all the obs in that hour
right_join(ideal_times, by = c("Animal_ID", "hour", "date")) %>%
# then remove all the obs that aren't in the exact interval
filter(as_datetime(timestamp) %within% time_interval) %>%
group_by(Animal_ID, time_interval) %>%
# then take the first observation
slice(1) %>%
ungroup() %>%
select(-time_interval)
}
# choose 10% so that observations are not equally spaced
sample_df <- slice_sample(df, prop = .1)
get_observations(sample_df, times_per_hour = 3,
min_date_time = ISOdate(2022,05,20), max_date_time = ISOdate(2022,05,22))
#> # A tibble: 259 × 4
#> Animal_ID timestamp hour date
#> <chr> <chr> <int> <date>
#> 1 Avi 2022-05-20 12:00:00 12 2022-05-20
#> 2 Avi 2022-05-20 12:32:00 12 2022-05-20
#> 3 Avi 2022-05-20 12:48:00 12 2022-05-20
#> 4 Avi 2022-05-20 13:15:00 13 2022-05-20
#> 5 Avi 2022-05-20 13:35:00 13 2022-05-20
#> 6 Avi 2022-05-20 13:52:00 13 2022-05-20
#> 7 Avi 2022-05-20 14:17:00 14 2022-05-20
#> 8 Avi 2022-05-20 14:28:00 14 2022-05-20
#> 9 Avi 2022-05-20 14:48:00 14 2022-05-20
#> 10 Avi 2022-05-20 15:16:00 15 2022-05-20
#> # … with 249 more rows
由 reprex package (v2.0.1)
于 2022-05-23 创建