查找并删除不同 ID 中日期顺序的差距
find and delete gaps in date order within different IDs
我有一个数据框,其中包含连续几天的不同 ID 和观察结果。如果连续几天没有一个ID的数据,我想删除它们。
我使用 diff(days)
函数来显示日期之间的差异,但我只能对一个 ID 执行此操作。
我的 df 是这样的:
ani_id_year days
1 ID468_2006 2006-04-01
2 ID468_2006 2006-04-02
3 ID468_2006 2006-04-03
4 ID468_2006 2006-04-04
5 ID468_2006 2006-04-05
6 ID599_2006 2006-03-06
7 ID599_2006 2006-03-14
8 ID599_2006 2006-03-15
9 ID599_2006 2006-03-16
所以我可以看到,ID599_2006 中存在 7 天的空缺,如果空缺 =<7,我想自动将其删除。由于我有数百个ID,我无法手动执行此操作。
也许你能帮帮我,非常感谢!
最好的,基督徒
如果您想删除每个 ID 的所有条目,这是一种方法。
library(tidyverse)
df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006",
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006",
"ID599_2006"), days = c("2006-04-01", "2006-04-02", "2006-04-03",
"2006-04-04", "2006-04-05", "2006-03-06", "2006-03-14", "2006-03-15",
"2006-03-16")), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
data <- as_tibble(df) %>%
mutate(days = as.Date(days))
data %>% group_by(ani_id_year) %>%
mutate(difference = as.numeric(days - lag(days))) %>%
mutate(to_delete = ifelse(max(difference, na.rm = TRUE) <= 7,
"keep", "remove")) %>%
filter(to_delete == "keep")
#> # A tibble: 5 x 4
#> # Groups: ani_id_year [1]
#> ani_id_year days difference to_delete
#> <chr> <date> <dbl> <chr>
#> 1 ID468_2006 2006-04-01 NA keep
#> 2 ID468_2006 2006-04-02 1 keep
#> 3 ID468_2006 2006-04-03 1 keep
#> 4 ID468_2006 2006-04-04 1 keep
#> 5 ID468_2006 2006-04-05 1 keep
由 reprex package (v0.3.0)
于 2020-08-18 创建
1. base
解决方案
subset(df, !ani_id_year %in% ani_id_year[c(F, diff(days) > 7)])
2。 dplyr
解决方案
library(dplyr)
选项 1
df %>%
filter(!ani_id_year %in% ani_id_year[c(F, diff(days) > 7)])
选项 2
df %>%
group_by(ani_id_year) %>%
filter(!any(diff(days) > 7))
输出
# ani_id_year days
# 1 ID468_2006 2006-04-01
# 2 ID468_2006 2006-04-02
# 3 ID468_2006 2006-04-03
# 4 ID468_2006 2006-04-04
# 5 ID468_2006 2006-04-05
数据
df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006",
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006",
"ID599_2006"), days = structure(c(13239, 13240, 13241, 13242,
13243, 13213, 13221, 13222, 13223), class = "Date")), row.names = c(NA,
-9L), class = "data.frame")
选项data.table
library(data.table)
setDT(df)[, .SD[!any(diff(days) > 7)], (ani_id_year)]
# ani_id_year days
#1: ID468_2006 2006-04-01
#2: ID468_2006 2006-04-02
#3: ID468_2006 2006-04-03
#4: ID468_2006 2006-04-04
#5: ID468_2006 2006-04-05
数据
df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006",
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006",
"ID599_2006"), days = structure(c(13239, 13240, 13241, 13242,
13243, 13213, 13221, 13222, 13223), class = "Date")), row.names = c(NA,
-9L), class = "data.frame")
我有一个数据框,其中包含连续几天的不同 ID 和观察结果。如果连续几天没有一个ID的数据,我想删除它们。
我使用 diff(days)
函数来显示日期之间的差异,但我只能对一个 ID 执行此操作。
我的 df 是这样的:
ani_id_year days
1 ID468_2006 2006-04-01
2 ID468_2006 2006-04-02
3 ID468_2006 2006-04-03
4 ID468_2006 2006-04-04
5 ID468_2006 2006-04-05
6 ID599_2006 2006-03-06
7 ID599_2006 2006-03-14
8 ID599_2006 2006-03-15
9 ID599_2006 2006-03-16
所以我可以看到,ID599_2006 中存在 7 天的空缺,如果空缺 =<7,我想自动将其删除。由于我有数百个ID,我无法手动执行此操作。
也许你能帮帮我,非常感谢!
最好的,基督徒
如果您想删除每个 ID 的所有条目,这是一种方法。
library(tidyverse)
df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006",
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006",
"ID599_2006"), days = c("2006-04-01", "2006-04-02", "2006-04-03",
"2006-04-04", "2006-04-05", "2006-03-06", "2006-03-14", "2006-03-15",
"2006-03-16")), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
data <- as_tibble(df) %>%
mutate(days = as.Date(days))
data %>% group_by(ani_id_year) %>%
mutate(difference = as.numeric(days - lag(days))) %>%
mutate(to_delete = ifelse(max(difference, na.rm = TRUE) <= 7,
"keep", "remove")) %>%
filter(to_delete == "keep")
#> # A tibble: 5 x 4
#> # Groups: ani_id_year [1]
#> ani_id_year days difference to_delete
#> <chr> <date> <dbl> <chr>
#> 1 ID468_2006 2006-04-01 NA keep
#> 2 ID468_2006 2006-04-02 1 keep
#> 3 ID468_2006 2006-04-03 1 keep
#> 4 ID468_2006 2006-04-04 1 keep
#> 5 ID468_2006 2006-04-05 1 keep
由 reprex package (v0.3.0)
于 2020-08-18 创建1. base
解决方案
subset(df, !ani_id_year %in% ani_id_year[c(F, diff(days) > 7)])
2。 dplyr
解决方案
library(dplyr)
选项 1
df %>% filter(!ani_id_year %in% ani_id_year[c(F, diff(days) > 7)])
选项 2
df %>% group_by(ani_id_year) %>% filter(!any(diff(days) > 7))
输出
# ani_id_year days
# 1 ID468_2006 2006-04-01
# 2 ID468_2006 2006-04-02
# 3 ID468_2006 2006-04-03
# 4 ID468_2006 2006-04-04
# 5 ID468_2006 2006-04-05
数据
df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006",
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006",
"ID599_2006"), days = structure(c(13239, 13240, 13241, 13242,
13243, 13213, 13221, 13222, 13223), class = "Date")), row.names = c(NA,
-9L), class = "data.frame")
选项data.table
library(data.table)
setDT(df)[, .SD[!any(diff(days) > 7)], (ani_id_year)]
# ani_id_year days
#1: ID468_2006 2006-04-01
#2: ID468_2006 2006-04-02
#3: ID468_2006 2006-04-03
#4: ID468_2006 2006-04-04
#5: ID468_2006 2006-04-05
数据
df <- structure(list(ani_id_year = c("ID468_2006", "ID468_2006", "ID468_2006",
"ID468_2006", "ID468_2006", "ID599_2006", "ID599_2006", "ID599_2006",
"ID599_2006"), days = structure(c(13239, 13240, 13241, 13242,
13243, 13213, 13221, 13222, 13223), class = "Date")), row.names = c(NA,
-9L), class = "data.frame")