如果间隙大于特定时间间隔或特定行数,则移除 NA

Remove NAs if gap is greater than certain time interval or certain number of rows

我有一个位置数据框,其中某些位置在特定日期时间具有 NA 值。我想估计这些 NA 值的位置,但是当连续有超过 3 个 NA 值时(间隔超过 3 小时),我想从数据集中删除它们(即我不想估计位置对于大于 3 rows/3 小时 NA 的间隔)。

这是我的数据示例:

table <- "id   date    time   lat   lon
 1 A     2011-10-03 05:00:00  35.0 -53.4
 2 A     2011-10-03 06:00:00  35.1 -53.4
 3 A     2011-10-03 07:00:00  NA    NA  
 4 A     2011-10-03 08:00:00  NA    NA  
 5 A     2011-10-03 09:00:00  35.1 -53.4
 6 A     2011-10-03 10:00:00  36.2 -53.6
 7 A     2011-10-03 23:00:00  36.6 -53.6
 8 B     2012-11-08 05:00:00  35.8 -53.4
 9 B     2012-11-08 06:00:00  NA    NA  
10 B     2012-11-08 07:00:00  36.0 -53.4
11 B     2012-11-08 08:00:00  NA    NA  
12 B     2012-11-08 09:00:00  NA    NA  
13 B     2012-11-08 10:00:00  36.5 -53.4
14 B     2012-11-08 23:00:00  36.6 -53.4
15 B     2012-11-09 00:00:00  NA    NA  
16 B     2012-11-09 01:00:00  NA    NA  
17 B     2012-11-09 02:00:00  NA    NA  
18 B     2012-11-09 03:00:00  NA    NA  
19 B     2012-11-09 04:00:00  NA    NA  
20 B     2012-11-09 05:00:00  36.6 -53.5"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df

df %>%
  unite(datetime, date, time, sep = ' ') %>%
  mutate(datetime = lubridate::ymd_hms(datetime))

这里是所需输出的示例: (请注意第 15-19 行现在是如何被删除的,因为这是 5 NA values/5 小时的间隔)。

table <- "id        datetime   lat   lon
 1 A     2011-10-03 05:00:00  35.0 -53.4
 2 A     2011-10-03 06:00:00  35.1 -53.4
 3 A     2011-10-03 07:00:00  NA    NA  
 4 A     2011-10-03 08:00:00  NA    NA  
 5 A     2011-10-03 09:00:00  35.1 -53.4
 6 A     2011-10-03 10:00:00  36.2 -53.6
 7 A     2011-10-03 23:00:00  36.6 -53.6
 8 B     2012-11-08 05:00:00  35.8 -53.4
 9 B     2012-11-08 06:00:00  NA    NA  
10 B     2012-11-08 07:00:00  36.0 -53.4
11 B     2012-11-08 08:00:00  NA    NA  
12 B     2012-11-08 09:00:00  NA    NA  
13 B     2012-11-08 10:00:00  36.5 -53.4
14 B     2012-11-08 23:00:00  36.6 -53.4 
15 B     2012-11-09 05:00:00  36.6 -53.5"

除了单独选择特定的行(我不能这样做,因为这个数据集很大),我无法弄清楚如何告诉 R 仅当它们在 3 个或更少的组(3 小时或更少)时才保留 NA。如有任何帮助,我们将不胜感激!

df %>%
  group_by(grp1 = cumsum(!is.na(lat) & !is.na(lon)), grp2 = !is.na(lat) & !is.na(lon)) %>%
  filter((!is.na(lat) & !is.na(lon)) | n() <= 3) %>%
  ungroup()
# # A tibble: 15 x 6
#    id    datetime              lat   lon  grp1 grp2 
#    <chr> <dttm>              <dbl> <dbl> <int> <lgl>
#  1 A     2011-10-03 05:00:00  35   -53.4     1 TRUE 
#  2 A     2011-10-03 06:00:00  35.1 -53.4     2 TRUE 
#  3 A     2011-10-03 07:00:00  NA    NA       2 FALSE
#  4 A     2011-10-03 08:00:00  NA    NA       2 FALSE
#  5 A     2011-10-03 09:00:00  35.1 -53.4     3 TRUE 
#  6 A     2011-10-03 10:00:00  36.2 -53.6     4 TRUE 
#  7 A     2011-10-03 23:00:00  36.6 -53.6     5 TRUE 
#  8 B     2012-11-08 05:00:00  35.8 -53.4     6 TRUE 
#  9 B     2012-11-08 06:00:00  NA    NA       6 FALSE
# 10 B     2012-11-08 07:00:00  36   -53.4     7 TRUE 
# 11 B     2012-11-08 08:00:00  NA    NA       7 FALSE
# 12 B     2012-11-08 09:00:00  NA    NA       7 FALSE
# 13 B     2012-11-08 10:00:00  36.5 -53.4     8 TRUE 
# 14 B     2012-11-08 23:00:00  36.6 -53.4     9 TRUE 
# 15 B     2012-11-09 05:00:00  36.6 -53.5    10 TRUE 

这创建了两个(临时)组:每当我们有一个非NA行(lat/lon)时,一个递增,然后第二个进一步子集它以便我们只看NA-满行(或未满)。

仅创建一个新分组列的替代方法:

df %>%
  mutate(tmpdttm = if_else(!is.na(lat) & !is.na(lon), datetime, datetime[NA])) %>%
  tidyr::fill(tmpdttm) %>%
  group_by(tmpdttm) %>%
  filter(!is.na(lat) | n() <= 3) %>%
  ungroup()
# # A tibble: 15 x 5
#    id    datetime              lat   lon tmpdttm            
#    <chr> <dttm>              <dbl> <dbl> <dttm>             
#  1 A     2011-10-03 05:00:00  35   -53.4 2011-10-03 05:00:00
#  2 A     2011-10-03 06:00:00  35.1 -53.4 2011-10-03 06:00:00
#  3 A     2011-10-03 07:00:00  NA    NA   2011-10-03 06:00:00
#  4 A     2011-10-03 08:00:00  NA    NA   2011-10-03 06:00:00
#  5 A     2011-10-03 09:00:00  35.1 -53.4 2011-10-03 09:00:00
#  6 A     2011-10-03 10:00:00  36.2 -53.6 2011-10-03 10:00:00
#  7 A     2011-10-03 23:00:00  36.6 -53.6 2011-10-03 23:00:00
#  8 B     2012-11-08 05:00:00  35.8 -53.4 2012-11-08 05:00:00
#  9 B     2012-11-08 06:00:00  NA    NA   2012-11-08 05:00:00
# 10 B     2012-11-08 07:00:00  36   -53.4 2012-11-08 07:00:00
# 11 B     2012-11-08 08:00:00  NA    NA   2012-11-08 07:00:00
# 12 B     2012-11-08 09:00:00  NA    NA   2012-11-08 07:00:00
# 13 B     2012-11-08 10:00:00  36.5 -53.4 2012-11-08 10:00:00
# 14 B     2012-11-08 23:00:00  36.6 -53.4 2012-11-08 23:00:00
# 15 B     2012-11-09 05:00:00  36.6 -53.5 2012-11-09 05:00:00

我使用 tidyverse

将我的分为两个步骤
df1 <- df %>% 
       group_by(id) %>% 
       mutate(gn = cumsum(!(is.na(lat) & is.na(lag(lat, default = 0))))) %>% 
       ungroup()
df1 %>% 
       group_by(id, gn) %>% 
       summarise(count = n()) %>% ungroup() %>% 
       filter(count < 5) %>% 
       inner_join(df1, by = c('id','gn'))

这是一个 tidyverse 解决方案,它使用 data.table

中的 rleid
library(data.table)
library(tidyverse)

df %>%
  unite(datetime, date, time, sep = ' ') %>%
  mutate(datetime = lubridate::ymd_hms(datetime)) %>%
  group_by(datetime, new = rleid(is.na(lat))) %>% 
  ungroup() %>% 
  group_by(lat,lon,new) %>% 
  filter(n()<3) %>% 
  select(-new)

这给了我们:

# A tibble: 15 x 5
     new id    datetime              lat   lon
   <int> <chr> <dttm>              <dbl> <dbl>
 1     1 A     2011-10-03 05:00:00  35   -53.4
 2     1 A     2011-10-03 06:00:00  35.1 -53.4
 3     2 A     2011-10-03 07:00:00  NA    NA  
 4     2 A     2011-10-03 08:00:00  NA    NA  
 5     3 A     2011-10-03 09:00:00  35.1 -53.4
 6     3 A     2011-10-03 10:00:00  36.2 -53.6
 7     3 A     2011-10-03 23:00:00  36.6 -53.6
 8     3 B     2012-11-08 05:00:00  35.8 -53.4
 9     4 B     2012-11-08 06:00:00  NA    NA  
10     5 B     2012-11-08 07:00:00  36   -53.4
11     6 B     2012-11-08 08:00:00  NA    NA  
12     6 B     2012-11-08 09:00:00  NA    NA  
13     7 B     2012-11-08 10:00:00  36.5 -53.4
14     7 B     2012-11-08 23:00:00  36.6 -53.4
15     9 B     2012-11-09 05:00:00  36.6 -53.5