查找并删除不同 ID 中日期顺序的差距

find and delete gaps in date order within different IDs

我有一个数据框,其中包含连续几天的不同 ID 和观察结果。如果连续几天没有一个ID的数据,我想删除它们。

我使用 diff(days) 函数来显示日期之间的差异,但我只能对一个 ID 执行此操作。

我的 df 是这样的:

  ani_id_year       days
1  ID468_2006 2006-04-01
2  ID468_2006 2006-04-02
3  ID468_2006 2006-04-03
4  ID468_2006 2006-04-04
5  ID468_2006 2006-04-05
6  ID599_2006 2006-03-06
7  ID599_2006 2006-03-14
8  ID599_2006 2006-03-15
9  ID599_2006 2006-03-16

所以我可以看到,ID599_2006 中存在 7 天的空缺,如果空缺 =<7,我想自动将其删除。由于我有数百个ID,我无法手动执行此操作。

也许你能帮帮我,非常感谢!

最好的,基督徒

如果您想删除每个 ID 的所有条目,这是一种方法。

library(tidyverse)
df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006", 
                                     "ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006", 
                                     "ID599_2006"), days = c("2006-04-01", "2006-04-02", "2006-04-03", 
                                                             "2006-04-04", "2006-04-05", "2006-03-06", "2006-03-14", "2006-03-15", 
                                                             "2006-03-16")), row.names = c(NA, -9L), class = c("tbl_df", "tbl", 
                                                                                                               "data.frame"))
data <- as_tibble(df) %>% 
  mutate(days = as.Date(days))

data %>% group_by(ani_id_year) %>% 
  mutate(difference = as.numeric(days - lag(days))) %>% 
  mutate(to_delete = ifelse(max(difference, na.rm = TRUE) <= 7,
                            "keep", "remove")) %>% 
  filter(to_delete == "keep")
#> # A tibble: 5 x 4
#> # Groups:   ani_id_year [1]
#>   ani_id_year days       difference to_delete
#>   <chr>       <date>          <dbl> <chr>    
#> 1 ID468_2006  2006-04-01         NA keep     
#> 2 ID468_2006  2006-04-02          1 keep     
#> 3 ID468_2006  2006-04-03          1 keep     
#> 4 ID468_2006  2006-04-04          1 keep     
#> 5 ID468_2006  2006-04-05          1 keep

reprex package (v0.3.0)

于 2020-08-18 创建

1. base解决方案

subset(df, !ani_id_year %in% ani_id_year[c(F, diff(days) > 7)])

2。 dplyr解决方案

library(dplyr)
  • 选项 1

    df %>%
      filter(!ani_id_year %in% ani_id_year[c(F, diff(days) > 7)])
    
  • 选项 2

    df %>%
      group_by(ani_id_year) %>%
      filter(!any(diff(days) > 7))
    

输出

#   ani_id_year       days
# 1  ID468_2006 2006-04-01
# 2  ID468_2006 2006-04-02
# 3  ID468_2006 2006-04-03
# 4  ID468_2006 2006-04-04
# 5  ID468_2006 2006-04-05

数据

df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006", 
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006", 
"ID599_2006"), days = structure(c(13239, 13240, 13241, 13242, 
13243, 13213, 13221, 13222, 13223), class = "Date")), row.names = c(NA, 
-9L), class = "data.frame")

选项data.table

library(data.table)
setDT(df)[, .SD[!any(diff(days) > 7)], (ani_id_year)]
#  ani_id_year       days
#1:  ID468_2006 2006-04-01
#2:  ID468_2006 2006-04-02
#3:  ID468_2006 2006-04-03
#4:  ID468_2006 2006-04-04
#5:  ID468_2006 2006-04-05

数据

df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006", 
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006", 
"ID599_2006"), days = structure(c(13239, 13240, 13241, 13242, 
13243, 13213, 13221, 13222, 13223), class = "Date")), row.names = c(NA, 
-9L), class = "data.frame")