查找并删除不同 ID 中日期顺序的差距

Question

我有一个数据框，其中包含连续几天的不同 ID 和观察结果。如果连续几天没有一个ID的数据，我想删除它们。

我使用 diff(days) 函数来显示日期之间的差异，但我只能对一个 ID 执行此操作。

我的 df 是这样的：

  ani_id_year       days
1  ID468_2006 2006-04-01
2  ID468_2006 2006-04-02
3  ID468_2006 2006-04-03
4  ID468_2006 2006-04-04
5  ID468_2006 2006-04-05
6  ID599_2006 2006-03-06
7  ID599_2006 2006-03-14
8  ID599_2006 2006-03-15
9  ID599_2006 2006-03-16

所以我可以看到，ID599_2006 中存在 7 天的空缺，如果空缺 =<7，我想自动将其删除。由于我有数百个ID，我无法手动执行此操作。

也许你能帮帮我，非常感谢！

最好的，基督徒

Answer 1

如果您想删除每个 ID 的所有条目，这是一种方法。

library(tidyverse)
df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006", 
                                     "ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006", 
                                     "ID599_2006"), days = c("2006-04-01", "2006-04-02", "2006-04-03", 
                                                             "2006-04-04", "2006-04-05", "2006-03-06", "2006-03-14", "2006-03-15", 
                                                             "2006-03-16")), row.names = c(NA, -9L), class = c("tbl_df", "tbl", 
                                                                                                               "data.frame"))
data <- as_tibble(df) %>% 
  mutate(days = as.Date(days))

data %>% group_by(ani_id_year) %>% 
  mutate(difference = as.numeric(days - lag(days))) %>% 
  mutate(to_delete = ifelse(max(difference, na.rm = TRUE) <= 7,
                            "keep", "remove")) %>% 
  filter(to_delete == "keep")
#> # A tibble: 5 x 4
#> # Groups:   ani_id_year [1]
#>   ani_id_year days       difference to_delete
#>   <chr>       <date>          <dbl> <chr>    
#> 1 ID468_2006  2006-04-01         NA keep     
#> 2 ID468_2006  2006-04-02          1 keep     
#> 3 ID468_2006  2006-04-03          1 keep     
#> 4 ID468_2006  2006-04-04          1 keep     
#> 5 ID468_2006  2006-04-05          1 keep

^{由 reprex package (v0.3.0)}

于 2020-08-18 创建

Answer 2

1. base解决方案

subset(df, !ani_id_year %in% ani_id_year[c(F, diff(days) > 7)])

2。 dplyr解决方案

library(dplyr)

选项 1

df %>%
  filter(!ani_id_year %in% ani_id_year[c(F, diff(days) > 7)])

选项 2

df %>%
  group_by(ani_id_year) %>%
  filter(!any(diff(days) > 7))

输出

#   ani_id_year       days
# 1  ID468_2006 2006-04-01
# 2  ID468_2006 2006-04-02
# 3  ID468_2006 2006-04-03
# 4  ID468_2006 2006-04-04
# 5  ID468_2006 2006-04-05

数据

df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006", 
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006", 
"ID599_2006"), days = structure(c(13239, 13240, 13241, 13242, 
13243, 13213, 13221, 13222, 13223), class = "Date")), row.names = c(NA, 
-9L), class = "data.frame")

Answer 3

选项data.table

library(data.table)
setDT(df)[, .SD[!any(diff(days) > 7)], (ani_id_year)]
#  ani_id_year       days
#1:  ID468_2006 2006-04-01
#2:  ID468_2006 2006-04-02
#3:  ID468_2006 2006-04-03
#4:  ID468_2006 2006-04-04
#5:  ID468_2006 2006-04-05

数据

df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006", 
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006", 
"ID599_2006"), days = structure(c(13239, 13240, 13241, 13242, 
13243, 13213, 13221, 13222, 13223), class = "Date")), row.names = c(NA, 
-9L), class = "data.frame")

查找并删除不同 ID 中日期顺序的差距

find and delete gaps in date order within different IDs

sorting

r

days

difference

数据