如果间隙大于特定时间间隔或特定行数,则移除 NA
Remove NAs if gap is greater than certain time interval or certain number of rows
我有一个位置数据框,其中某些位置在特定日期时间具有 NA 值。我想估计这些 NA 值的位置,但是当连续有超过 3 个 NA 值时(间隔超过 3 小时),我想从数据集中删除它们(即我不想估计位置对于大于 3 rows/3 小时 NA 的间隔)。
这是我的数据示例:
table <- "id date time lat lon
1 A 2011-10-03 05:00:00 35.0 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
8 B 2012-11-08 05:00:00 35.8 -53.4
9 B 2012-11-08 06:00:00 NA NA
10 B 2012-11-08 07:00:00 36.0 -53.4
11 B 2012-11-08 08:00:00 NA NA
12 B 2012-11-08 09:00:00 NA NA
13 B 2012-11-08 10:00:00 36.5 -53.4
14 B 2012-11-08 23:00:00 36.6 -53.4
15 B 2012-11-09 00:00:00 NA NA
16 B 2012-11-09 01:00:00 NA NA
17 B 2012-11-09 02:00:00 NA NA
18 B 2012-11-09 03:00:00 NA NA
19 B 2012-11-09 04:00:00 NA NA
20 B 2012-11-09 05:00:00 36.6 -53.5"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime))
这里是所需输出的示例:
(请注意第 15-19 行现在是如何被删除的,因为这是 5 NA values/5 小时的间隔)。
table <- "id datetime lat lon
1 A 2011-10-03 05:00:00 35.0 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
8 B 2012-11-08 05:00:00 35.8 -53.4
9 B 2012-11-08 06:00:00 NA NA
10 B 2012-11-08 07:00:00 36.0 -53.4
11 B 2012-11-08 08:00:00 NA NA
12 B 2012-11-08 09:00:00 NA NA
13 B 2012-11-08 10:00:00 36.5 -53.4
14 B 2012-11-08 23:00:00 36.6 -53.4
15 B 2012-11-09 05:00:00 36.6 -53.5"
除了单独选择特定的行(我不能这样做,因为这个数据集很大),我无法弄清楚如何告诉 R 仅当它们在 3 个或更少的组(3 小时或更少)时才保留 NA。如有任何帮助,我们将不胜感激!
df %>%
group_by(grp1 = cumsum(!is.na(lat) & !is.na(lon)), grp2 = !is.na(lat) & !is.na(lon)) %>%
filter((!is.na(lat) & !is.na(lon)) | n() <= 3) %>%
ungroup()
# # A tibble: 15 x 6
# id datetime lat lon grp1 grp2
# <chr> <dttm> <dbl> <dbl> <int> <lgl>
# 1 A 2011-10-03 05:00:00 35 -53.4 1 TRUE
# 2 A 2011-10-03 06:00:00 35.1 -53.4 2 TRUE
# 3 A 2011-10-03 07:00:00 NA NA 2 FALSE
# 4 A 2011-10-03 08:00:00 NA NA 2 FALSE
# 5 A 2011-10-03 09:00:00 35.1 -53.4 3 TRUE
# 6 A 2011-10-03 10:00:00 36.2 -53.6 4 TRUE
# 7 A 2011-10-03 23:00:00 36.6 -53.6 5 TRUE
# 8 B 2012-11-08 05:00:00 35.8 -53.4 6 TRUE
# 9 B 2012-11-08 06:00:00 NA NA 6 FALSE
# 10 B 2012-11-08 07:00:00 36 -53.4 7 TRUE
# 11 B 2012-11-08 08:00:00 NA NA 7 FALSE
# 12 B 2012-11-08 09:00:00 NA NA 7 FALSE
# 13 B 2012-11-08 10:00:00 36.5 -53.4 8 TRUE
# 14 B 2012-11-08 23:00:00 36.6 -53.4 9 TRUE
# 15 B 2012-11-09 05:00:00 36.6 -53.5 10 TRUE
这创建了两个(临时)组:每当我们有一个非NA
行(lat/lon)时,一个递增,然后第二个进一步子集它以便我们只看NA
-满行(或未满)。
仅创建一个新分组列的替代方法:
df %>%
mutate(tmpdttm = if_else(!is.na(lat) & !is.na(lon), datetime, datetime[NA])) %>%
tidyr::fill(tmpdttm) %>%
group_by(tmpdttm) %>%
filter(!is.na(lat) | n() <= 3) %>%
ungroup()
# # A tibble: 15 x 5
# id datetime lat lon tmpdttm
# <chr> <dttm> <dbl> <dbl> <dttm>
# 1 A 2011-10-03 05:00:00 35 -53.4 2011-10-03 05:00:00
# 2 A 2011-10-03 06:00:00 35.1 -53.4 2011-10-03 06:00:00
# 3 A 2011-10-03 07:00:00 NA NA 2011-10-03 06:00:00
# 4 A 2011-10-03 08:00:00 NA NA 2011-10-03 06:00:00
# 5 A 2011-10-03 09:00:00 35.1 -53.4 2011-10-03 09:00:00
# 6 A 2011-10-03 10:00:00 36.2 -53.6 2011-10-03 10:00:00
# 7 A 2011-10-03 23:00:00 36.6 -53.6 2011-10-03 23:00:00
# 8 B 2012-11-08 05:00:00 35.8 -53.4 2012-11-08 05:00:00
# 9 B 2012-11-08 06:00:00 NA NA 2012-11-08 05:00:00
# 10 B 2012-11-08 07:00:00 36 -53.4 2012-11-08 07:00:00
# 11 B 2012-11-08 08:00:00 NA NA 2012-11-08 07:00:00
# 12 B 2012-11-08 09:00:00 NA NA 2012-11-08 07:00:00
# 13 B 2012-11-08 10:00:00 36.5 -53.4 2012-11-08 10:00:00
# 14 B 2012-11-08 23:00:00 36.6 -53.4 2012-11-08 23:00:00
# 15 B 2012-11-09 05:00:00 36.6 -53.5 2012-11-09 05:00:00
我使用 tidyverse
将我的分为两个步骤
df1 <- df %>%
group_by(id) %>%
mutate(gn = cumsum(!(is.na(lat) & is.na(lag(lat, default = 0))))) %>%
ungroup()
df1 %>%
group_by(id, gn) %>%
summarise(count = n()) %>% ungroup() %>%
filter(count < 5) %>%
inner_join(df1, by = c('id','gn'))
这是一个 tidyverse
解决方案,它使用 data.table
中的 rleid
library(data.table)
library(tidyverse)
df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime)) %>%
group_by(datetime, new = rleid(is.na(lat))) %>%
ungroup() %>%
group_by(lat,lon,new) %>%
filter(n()<3) %>%
select(-new)
这给了我们:
# A tibble: 15 x 5
new id datetime lat lon
<int> <chr> <dttm> <dbl> <dbl>
1 1 A 2011-10-03 05:00:00 35 -53.4
2 1 A 2011-10-03 06:00:00 35.1 -53.4
3 2 A 2011-10-03 07:00:00 NA NA
4 2 A 2011-10-03 08:00:00 NA NA
5 3 A 2011-10-03 09:00:00 35.1 -53.4
6 3 A 2011-10-03 10:00:00 36.2 -53.6
7 3 A 2011-10-03 23:00:00 36.6 -53.6
8 3 B 2012-11-08 05:00:00 35.8 -53.4
9 4 B 2012-11-08 06:00:00 NA NA
10 5 B 2012-11-08 07:00:00 36 -53.4
11 6 B 2012-11-08 08:00:00 NA NA
12 6 B 2012-11-08 09:00:00 NA NA
13 7 B 2012-11-08 10:00:00 36.5 -53.4
14 7 B 2012-11-08 23:00:00 36.6 -53.4
15 9 B 2012-11-09 05:00:00 36.6 -53.5
我有一个位置数据框,其中某些位置在特定日期时间具有 NA 值。我想估计这些 NA 值的位置,但是当连续有超过 3 个 NA 值时(间隔超过 3 小时),我想从数据集中删除它们(即我不想估计位置对于大于 3 rows/3 小时 NA 的间隔)。
这是我的数据示例:
table <- "id date time lat lon
1 A 2011-10-03 05:00:00 35.0 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
8 B 2012-11-08 05:00:00 35.8 -53.4
9 B 2012-11-08 06:00:00 NA NA
10 B 2012-11-08 07:00:00 36.0 -53.4
11 B 2012-11-08 08:00:00 NA NA
12 B 2012-11-08 09:00:00 NA NA
13 B 2012-11-08 10:00:00 36.5 -53.4
14 B 2012-11-08 23:00:00 36.6 -53.4
15 B 2012-11-09 00:00:00 NA NA
16 B 2012-11-09 01:00:00 NA NA
17 B 2012-11-09 02:00:00 NA NA
18 B 2012-11-09 03:00:00 NA NA
19 B 2012-11-09 04:00:00 NA NA
20 B 2012-11-09 05:00:00 36.6 -53.5"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime))
这里是所需输出的示例: (请注意第 15-19 行现在是如何被删除的,因为这是 5 NA values/5 小时的间隔)。
table <- "id datetime lat lon
1 A 2011-10-03 05:00:00 35.0 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
8 B 2012-11-08 05:00:00 35.8 -53.4
9 B 2012-11-08 06:00:00 NA NA
10 B 2012-11-08 07:00:00 36.0 -53.4
11 B 2012-11-08 08:00:00 NA NA
12 B 2012-11-08 09:00:00 NA NA
13 B 2012-11-08 10:00:00 36.5 -53.4
14 B 2012-11-08 23:00:00 36.6 -53.4
15 B 2012-11-09 05:00:00 36.6 -53.5"
除了单独选择特定的行(我不能这样做,因为这个数据集很大),我无法弄清楚如何告诉 R 仅当它们在 3 个或更少的组(3 小时或更少)时才保留 NA。如有任何帮助,我们将不胜感激!
df %>%
group_by(grp1 = cumsum(!is.na(lat) & !is.na(lon)), grp2 = !is.na(lat) & !is.na(lon)) %>%
filter((!is.na(lat) & !is.na(lon)) | n() <= 3) %>%
ungroup()
# # A tibble: 15 x 6
# id datetime lat lon grp1 grp2
# <chr> <dttm> <dbl> <dbl> <int> <lgl>
# 1 A 2011-10-03 05:00:00 35 -53.4 1 TRUE
# 2 A 2011-10-03 06:00:00 35.1 -53.4 2 TRUE
# 3 A 2011-10-03 07:00:00 NA NA 2 FALSE
# 4 A 2011-10-03 08:00:00 NA NA 2 FALSE
# 5 A 2011-10-03 09:00:00 35.1 -53.4 3 TRUE
# 6 A 2011-10-03 10:00:00 36.2 -53.6 4 TRUE
# 7 A 2011-10-03 23:00:00 36.6 -53.6 5 TRUE
# 8 B 2012-11-08 05:00:00 35.8 -53.4 6 TRUE
# 9 B 2012-11-08 06:00:00 NA NA 6 FALSE
# 10 B 2012-11-08 07:00:00 36 -53.4 7 TRUE
# 11 B 2012-11-08 08:00:00 NA NA 7 FALSE
# 12 B 2012-11-08 09:00:00 NA NA 7 FALSE
# 13 B 2012-11-08 10:00:00 36.5 -53.4 8 TRUE
# 14 B 2012-11-08 23:00:00 36.6 -53.4 9 TRUE
# 15 B 2012-11-09 05:00:00 36.6 -53.5 10 TRUE
这创建了两个(临时)组:每当我们有一个非NA
行(lat/lon)时,一个递增,然后第二个进一步子集它以便我们只看NA
-满行(或未满)。
仅创建一个新分组列的替代方法:
df %>%
mutate(tmpdttm = if_else(!is.na(lat) & !is.na(lon), datetime, datetime[NA])) %>%
tidyr::fill(tmpdttm) %>%
group_by(tmpdttm) %>%
filter(!is.na(lat) | n() <= 3) %>%
ungroup()
# # A tibble: 15 x 5
# id datetime lat lon tmpdttm
# <chr> <dttm> <dbl> <dbl> <dttm>
# 1 A 2011-10-03 05:00:00 35 -53.4 2011-10-03 05:00:00
# 2 A 2011-10-03 06:00:00 35.1 -53.4 2011-10-03 06:00:00
# 3 A 2011-10-03 07:00:00 NA NA 2011-10-03 06:00:00
# 4 A 2011-10-03 08:00:00 NA NA 2011-10-03 06:00:00
# 5 A 2011-10-03 09:00:00 35.1 -53.4 2011-10-03 09:00:00
# 6 A 2011-10-03 10:00:00 36.2 -53.6 2011-10-03 10:00:00
# 7 A 2011-10-03 23:00:00 36.6 -53.6 2011-10-03 23:00:00
# 8 B 2012-11-08 05:00:00 35.8 -53.4 2012-11-08 05:00:00
# 9 B 2012-11-08 06:00:00 NA NA 2012-11-08 05:00:00
# 10 B 2012-11-08 07:00:00 36 -53.4 2012-11-08 07:00:00
# 11 B 2012-11-08 08:00:00 NA NA 2012-11-08 07:00:00
# 12 B 2012-11-08 09:00:00 NA NA 2012-11-08 07:00:00
# 13 B 2012-11-08 10:00:00 36.5 -53.4 2012-11-08 10:00:00
# 14 B 2012-11-08 23:00:00 36.6 -53.4 2012-11-08 23:00:00
# 15 B 2012-11-09 05:00:00 36.6 -53.5 2012-11-09 05:00:00
我使用 tidyverse
df1 <- df %>%
group_by(id) %>%
mutate(gn = cumsum(!(is.na(lat) & is.na(lag(lat, default = 0))))) %>%
ungroup()
df1 %>%
group_by(id, gn) %>%
summarise(count = n()) %>% ungroup() %>%
filter(count < 5) %>%
inner_join(df1, by = c('id','gn'))
这是一个 tidyverse
解决方案,它使用 data.table
rleid
library(data.table)
library(tidyverse)
df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime)) %>%
group_by(datetime, new = rleid(is.na(lat))) %>%
ungroup() %>%
group_by(lat,lon,new) %>%
filter(n()<3) %>%
select(-new)
这给了我们:
# A tibble: 15 x 5
new id datetime lat lon
<int> <chr> <dttm> <dbl> <dbl>
1 1 A 2011-10-03 05:00:00 35 -53.4
2 1 A 2011-10-03 06:00:00 35.1 -53.4
3 2 A 2011-10-03 07:00:00 NA NA
4 2 A 2011-10-03 08:00:00 NA NA
5 3 A 2011-10-03 09:00:00 35.1 -53.4
6 3 A 2011-10-03 10:00:00 36.2 -53.6
7 3 A 2011-10-03 23:00:00 36.6 -53.6
8 3 B 2012-11-08 05:00:00 35.8 -53.4
9 4 B 2012-11-08 06:00:00 NA NA
10 5 B 2012-11-08 07:00:00 36 -53.4
11 6 B 2012-11-08 08:00:00 NA NA
12 6 B 2012-11-08 09:00:00 NA NA
13 7 B 2012-11-08 10:00:00 36.5 -53.4
14 7 B 2012-11-08 23:00:00 36.6 -53.4
15 9 B 2012-11-09 05:00:00 36.6 -53.5