是否可以根据 R 中的时间变量过滤掉重叠的异常值?
Is it possible to filter out outliers that overlap based on time variable in R?
如何通过检查 ID 列和 date/time 列与 filter()
之间的重叠来删除异常值?
例如ID=1的行如下图前2行在时间上重叠,需要删除。
ID
Time start
Time end
1
2015-03-16 10:40:00
2015-03-16 11:10:00
1
2015-03-16 10:50:00
2015-03-16 10:59:00
2
2015-03-16 10:40:00
2015-03-16 10:45:00
1
2015-03-16 11:20:00
2015-03-16 11:28:56
试试这个来删除组内的任何时间重叠。请使用更多数据对其进行测试,看看它是否符合您的要求。我只试了下面的小样。
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(slider)
tribble(
~id, ~start, ~end,
1, "2015-03-16 10:40:00", "2015-03-16 11:10:00",
1, "2015-03-16 10:50:00", "2015-03-16 10:59:00",
1, "2015-03-16 11:09:00", "2015-03-16 11:11:00",
2, "2015-03-16 10:40:00", "2015-03-16 10:45:00",
1, "2015-03-16 11:20:00", "2015-03-16 11:28:56",
1, "2015-03-16 11:27:00", "2015-03-16 11:30:56",
2, "2015-03-16 10:44:00", "2015-03-16 11:45:00"
) |>
mutate(
start = ymd_hms(start, tz = Sys.timezone()),
end = ymd_hms(end, tz = Sys.timezone())
) |>
arrange(id, start, end) |>
group_by(id) |>
mutate(
roll_start = slide_vec(start, min, .before = Inf),
roll_end = slide_vec(end, max, .before = Inf),
overlap = if_else((start >= lag(roll_start) & start <= lag(roll_end)) |
(end >= lag(roll_start) & end <= lag(roll_end)), "yes", "no")
) |>
filter(overlap == "no" | is.na(overlap)) |>
select(- c(starts_with("roll_"), overlap))
#> # A tibble: 3 × 3
#> # Groups: id [2]
#> id start end
#> <dbl> <dttm> <dttm>
#> 1 1 2015-03-16 10:40:00 2015-03-16 11:10:00
#> 2 1 2015-03-16 11:20:00 2015-03-16 11:28:56
#> 3 2 2015-03-16 10:40:00 2015-03-16 10:45:00
由 reprex package (v2.0.1)
于 2022-04-30 创建
如何通过检查 ID 列和 date/time 列与 filter()
之间的重叠来删除异常值?
例如ID=1的行如下图前2行在时间上重叠,需要删除。
ID | Time start | Time end |
---|---|---|
1 | 2015-03-16 10:40:00 | 2015-03-16 11:10:00 |
1 | 2015-03-16 10:50:00 | 2015-03-16 10:59:00 |
2 | 2015-03-16 10:40:00 | 2015-03-16 10:45:00 |
1 | 2015-03-16 11:20:00 | 2015-03-16 11:28:56 |
试试这个来删除组内的任何时间重叠。请使用更多数据对其进行测试,看看它是否符合您的要求。我只试了下面的小样。
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(slider)
tribble(
~id, ~start, ~end,
1, "2015-03-16 10:40:00", "2015-03-16 11:10:00",
1, "2015-03-16 10:50:00", "2015-03-16 10:59:00",
1, "2015-03-16 11:09:00", "2015-03-16 11:11:00",
2, "2015-03-16 10:40:00", "2015-03-16 10:45:00",
1, "2015-03-16 11:20:00", "2015-03-16 11:28:56",
1, "2015-03-16 11:27:00", "2015-03-16 11:30:56",
2, "2015-03-16 10:44:00", "2015-03-16 11:45:00"
) |>
mutate(
start = ymd_hms(start, tz = Sys.timezone()),
end = ymd_hms(end, tz = Sys.timezone())
) |>
arrange(id, start, end) |>
group_by(id) |>
mutate(
roll_start = slide_vec(start, min, .before = Inf),
roll_end = slide_vec(end, max, .before = Inf),
overlap = if_else((start >= lag(roll_start) & start <= lag(roll_end)) |
(end >= lag(roll_start) & end <= lag(roll_end)), "yes", "no")
) |>
filter(overlap == "no" | is.na(overlap)) |>
select(- c(starts_with("roll_"), overlap))
#> # A tibble: 3 × 3
#> # Groups: id [2]
#> id start end
#> <dbl> <dttm> <dttm>
#> 1 1 2015-03-16 10:40:00 2015-03-16 11:10:00
#> 2 1 2015-03-16 11:20:00 2015-03-16 11:28:56
#> 3 2 2015-03-16 10:40:00 2015-03-16 10:45:00
由 reprex package (v2.0.1)
于 2022-04-30 创建