R 函数,用于选择每 hour\day 个均匀分开的 x 个观测值

R function for selecting an x amount of observations per hour\day that are evenly separated

我正在尝试创建一个函数,该函数 select 在定义的时间范围内进行定义数量的观察。

我已经设法创建了一个函数,该函数每小时对一次观察进行子集化:

#create example df 
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))

#function
OnePerHour <- function(df){
  dataOnePerHour <- df %>% 
    group_by(Animal_ID, hour(timestamp), as.Date(timestamp))%>%
    filter(row_number(Animal_ID) == 1)
  return(dataOnePerHour)
}

但是,我无法集中精力扩展,所以我可以 select 更多 obs/hour 均匀分布。

在这个例子中,每分钟都有一次观察,但在“真实数据集”中,可能只有三到四个 observations/hr,一只动物间隔 15 分钟,另一只动物每秒观察一次。因此,假设我正在寻找 3 obs\hr,并且观察频率为 1/分钟,因此 1、21、41 正是我正在寻找的。如果只有三个(相隔 15 分钟),我想包括所有观察结果。

任何帮助将不胜感激。

伊丹

如果我理解正确的话,我可能会做这样的事情。可能有点长,我会加上日期,小时,分钟,先计算每个动物ID与上一个时间点的时差。

然后计算一个小时的观察次数,根据你的描述创建过滤逻辑列。

  df <- df %>% 
  mutate(dt = as.Date(timestamp),
         hr = hour(timestamp),
         m = minute(timestamp)) %>% 
  group_by(Animal_ID) %>% 
  mutate(time_diff = m-lag(m)) 

df <- df %>% 
  group_by(Animal_ID,
           dt,
           hr) %>% 
  mutate(num_in_hour = n(),
         filterlogic = (num_in_hour == 60 & m %in% c(1,21,41))|(num_in_hour %in% c(3,4)&time_diff==15)
  ) %>% 
  filter(filterlogic == TRUE)

这是一个解决方案,它在每小时内为每个 Animal_ID 创建 times_per_hour 等间隔,然后选择该间隔内的第一个观察值。但是,如果该时间间隔内没有任何观测值,则不会选择任何观测值。因此,如果您想要每小时 3 个,并且您在 12:01、12:02 和 12:03 处有观测值,您只会得到第一个,因为 [=] 之间没有观测值23=]-12:40 或 12:40-1:00.

library(dplyr)
library(tidyr)
library(lubridate)

#create example df 
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))

get_observations <- function(df, times_per_hour, min_date_time, max_date_time) {
  
  # madke dataframe with all possible minutes between min and max times
  timespan <- expand_grid(Animal_ID = unique(df$Animal_ID),
                          # replace with min and max datetimes of the data
                          timestamp = seq(min_date_time, max_date_time, "min"))
  
  ideal_times <- timespan %>% 
    group_by(Animal_ID, hour = hour(timestamp), date = as.Date(timestamp)) %>% 
    # select the beginning of the interval from which you want an observation
    slice(seq(1, n(), by = 60/times_per_hour)) %>% 
    mutate(time_interval = interval(timestamp, 
                                    lead(timestamp, default = max_date_time))) %>% 
    select(-timestamp)
  
  df %>% 
    mutate(hour = hour(timestamp), date = as.Date(timestamp)) %>% 
    # join so every time interval is matched with all the obs in that hour
    right_join(ideal_times, by = c("Animal_ID", "hour", "date")) %>% 
    # then remove all the obs that aren't in the exact interval
    filter(as_datetime(timestamp) %within% time_interval) %>% 
    group_by(Animal_ID, time_interval) %>% 
    # then take the first observation
    slice(1) %>% 
    ungroup() %>% 
    select(-time_interval)

}

# choose 10% so that observations are not equally spaced
sample_df <- slice_sample(df, prop = .1)
get_observations(sample_df, times_per_hour = 3, 
                 min_date_time = ISOdate(2022,05,20), max_date_time = ISOdate(2022,05,22))

#> # A tibble: 259 × 4
#>    Animal_ID timestamp            hour date      
#>    <chr>     <chr>               <int> <date>    
#>  1 Avi       2022-05-20 12:00:00    12 2022-05-20
#>  2 Avi       2022-05-20 12:32:00    12 2022-05-20
#>  3 Avi       2022-05-20 12:48:00    12 2022-05-20
#>  4 Avi       2022-05-20 13:15:00    13 2022-05-20
#>  5 Avi       2022-05-20 13:35:00    13 2022-05-20
#>  6 Avi       2022-05-20 13:52:00    13 2022-05-20
#>  7 Avi       2022-05-20 14:17:00    14 2022-05-20
#>  8 Avi       2022-05-20 14:28:00    14 2022-05-20
#>  9 Avi       2022-05-20 14:48:00    14 2022-05-20
#> 10 Avi       2022-05-20 15:16:00    15 2022-05-20
#> # … with 249 more rows

reprex package (v2.0.1)

于 2022-05-23 创建