创建列以对连续的 TRUE 或 FALSE 值求和,然后删除具有特定总和的连续 NA 的所有序列
Create column to sum consecutive TRUE or FALSE values, then remove all sequences with consecutive NAs of a certain sum
我有一个数据框 (df),其中包含 id
、date
、time
和位置(lat
和 lon
)。我的目标是创建一个对连续 NA 的长度求和的列,以删除大于特定数字的连续 NA 系列。
这是我的数据示例:
table <- "id date time lat lon
1 A 2011-10-03 05:00:00 35.0 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
8 B 2012-11-08 05:00:00 35.8 -53.4
9 B 2012-11-08 06:00:00 NA NA
10 B 2012-11-08 07:00:00 36.0 -53.4
11 B 2012-11-08 08:00:00 NA NA
12 B 2012-11-08 09:00:00 NA NA
13 B 2012-11-08 10:00:00 36.5 -53.4
14 B 2012-11-08 23:00:00 36.6 -53.4
15 B 2012-11-09 00:00:00 NA NA
16 B 2012-11-09 01:00:00 NA NA
17 B 2012-11-09 02:00:00 NA NA
18 B 2012-11-09 03:00:00 NA NA
19 B 2012-11-09 04:00:00 NA NA
20 B 2012-11-09 05:00:00 36.6 -53.5"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
df <- df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime))
我为 NA 值创建了一个新的 TRUE/FALSE 列:
df$gap <- ifelse(is.na(df$lat), TRUE, FALSE)
head(df)
# A tibble: 6 x 5
id datetime lat lon gap
<chr> <dttm> <dbl> <dbl> <lgl>
1 A 2011-10-03 05:00:00 35 -53.4 FALSE
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE
3 A 2011-10-03 07:00:00 NA NA TRUE
4 A 2011-10-03 08:00:00 NA NA TRUE
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE
然后尝试了各种解决方案来求和连续的 TRUE 或 FALSE,但我只能想出这个:
df <- df %>%
group_by(id, grp = with(rle(gap), rep(seq_along(lengths), lengths))) %>%
mutate(length = seq_along(grp)) %>%
ungroup() %>%
select(-grp)
head(df)
# A tibble: 6 x 6
id datetime lat lon gap length
<chr> <dttm> <dbl> <dbl> <lgl> <int>
1 A 2011-10-03 05:00:00 35 -53.4 FALSE 1
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE 2
3 A 2011-10-03 07:00:00 NA NA TRUE 1
4 A 2011-10-03 08:00:00 NA NA TRUE 2
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE 1
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE 2
问题是上面添加了序列 1、2、3、4、5 等的计数,而我希望整个点或 NA 序列包含连续 TRUE 或 FALSES 的总数(即 5, 5, 5, 5, 5).
所需的输出为:
table <- "id datetime lat lon gap length
1 A 2011-10-03 05:00:00 35 -53.4 FALSE 2
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE 2
3 A 2011-10-03 07:00:00 NA NA TRUE 2
4 A 2011-10-03 08:00:00 NA NA TRUE 2
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE 3
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE 3
7 A 2011-10-03 23:00:00 36.6 -53.6 FALSE 3
8 B 2012-11-08 05:00:00 35.8 -53.4 FALSE 1
9 B 2012-11-08 06:00:00 NA NA TRUE 1
10 B 2012-11-08 07:00:00 36 -53.4 FALSE 1
11 B 2012-11-08 08:00:00 NA NA TRUE 2
12 B 2012-11-08 09:00:00 NA NA TRUE 2
13 B 2012-11-08 10:00:00 36.5 -53.4 FALSE 2
14 B 2012-11-08 23:00:00 36.6 -53.4 FALSE 2
15 B 2012-11-09 00:00:00 NA NA TRUE 5
16 B 2012-11-09 01:00:00 NA NA TRUE 5
17 B 2012-11-09 02:00:00 NA NA TRUE 5
18 B 2012-11-09 03:00:00 NA NA TRUE 5
19 B 2012-11-09 04:00:00 NA NA TRUE 5
20 B 2012-11-09 05:00:00 36.6 -53.5 FALSE 1"
从这里开始,我需要从数据集中删除长度为 5 NAs 或更大的任何 ID。问题是我不想删除长度为 5 的 非 NA 值的 ID(即连续超过 5 个连续 lat/lon 位置的 ID需要留下来。
在此示例中,所需的输出将只是单个 A,因为 B 的 NA 长度大于 5:
table <- "id datetime lat lon gap length
1 A 2011-10-03 05:00:00 35 -53.4 FALSE 2
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE 2
3 A 2011-10-03 07:00:00 NA NA TRUE 2
4 A 2011-10-03 08:00:00 NA NA TRUE 2
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE 3
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE 3
7 A 2011-10-03 23:00:00 36.6 -53.6 FALSE 3"
但我需要确保删除长度为 5 或更大的间隙的代码不会删除具有 lat/lon 个长度为 5 或更大的位置的 ID。我不知道从哪里开始解决我的这部分问题。
任何帮助将不胜感激
tidyverse
df %>%
group_by(id) %>%
mutate(grp = data.table::rleid(is.na(lat))) %>%
group_by(grp, .add = TRUE) %>%
mutate(res = sum(is.na(lat))) %>%
group_by(id) %>%
filter(!any(res >= 5)) %>%
select(-c(grp, res)) %>%
ungroup()
# A tibble: 7 x 4
id datetime lat lon
<chr> <dttm> <dbl> <dbl>
1 A 2011-10-03 05:00:00 35 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
data.table
library(data.table)
setDT(df)[, grp := rleid(is.na(lat)), by = list(id)] %>%
.[, grp := .N, by = list(grp, id)] %>%
.[, .SD[!any(grp >= 5)], by = id] %>%
.[]
id datetime lat lon grp
1: A 2011-10-03 05:00:00 35.0 -53.4 2
2: A 2011-10-03 06:00:00 35.1 -53.4 2
3: A 2011-10-03 07:00:00 NA NA 2
4: A 2011-10-03 08:00:00 NA NA 2
5: A 2011-10-03 09:00:00 35.1 -53.4 3
6: A 2011-10-03 10:00:00 36.2 -53.6 3
7: A 2011-10-03 23:00:00 36.6 -53.6 3
我有一个数据框 (df),其中包含 id
、date
、time
和位置(lat
和 lon
)。我的目标是创建一个对连续 NA 的长度求和的列,以删除大于特定数字的连续 NA 系列。
这是我的数据示例:
table <- "id date time lat lon
1 A 2011-10-03 05:00:00 35.0 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
8 B 2012-11-08 05:00:00 35.8 -53.4
9 B 2012-11-08 06:00:00 NA NA
10 B 2012-11-08 07:00:00 36.0 -53.4
11 B 2012-11-08 08:00:00 NA NA
12 B 2012-11-08 09:00:00 NA NA
13 B 2012-11-08 10:00:00 36.5 -53.4
14 B 2012-11-08 23:00:00 36.6 -53.4
15 B 2012-11-09 00:00:00 NA NA
16 B 2012-11-09 01:00:00 NA NA
17 B 2012-11-09 02:00:00 NA NA
18 B 2012-11-09 03:00:00 NA NA
19 B 2012-11-09 04:00:00 NA NA
20 B 2012-11-09 05:00:00 36.6 -53.5"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
df <- df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime))
我为 NA 值创建了一个新的 TRUE/FALSE 列:
df$gap <- ifelse(is.na(df$lat), TRUE, FALSE)
head(df)
# A tibble: 6 x 5
id datetime lat lon gap
<chr> <dttm> <dbl> <dbl> <lgl>
1 A 2011-10-03 05:00:00 35 -53.4 FALSE
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE
3 A 2011-10-03 07:00:00 NA NA TRUE
4 A 2011-10-03 08:00:00 NA NA TRUE
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE
然后尝试了各种解决方案来求和连续的 TRUE 或 FALSE,但我只能想出这个:
df <- df %>%
group_by(id, grp = with(rle(gap), rep(seq_along(lengths), lengths))) %>%
mutate(length = seq_along(grp)) %>%
ungroup() %>%
select(-grp)
head(df)
# A tibble: 6 x 6
id datetime lat lon gap length
<chr> <dttm> <dbl> <dbl> <lgl> <int>
1 A 2011-10-03 05:00:00 35 -53.4 FALSE 1
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE 2
3 A 2011-10-03 07:00:00 NA NA TRUE 1
4 A 2011-10-03 08:00:00 NA NA TRUE 2
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE 1
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE 2
问题是上面添加了序列 1、2、3、4、5 等的计数,而我希望整个点或 NA 序列包含连续 TRUE 或 FALSES 的总数(即 5, 5, 5, 5, 5).
所需的输出为:
table <- "id datetime lat lon gap length
1 A 2011-10-03 05:00:00 35 -53.4 FALSE 2
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE 2
3 A 2011-10-03 07:00:00 NA NA TRUE 2
4 A 2011-10-03 08:00:00 NA NA TRUE 2
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE 3
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE 3
7 A 2011-10-03 23:00:00 36.6 -53.6 FALSE 3
8 B 2012-11-08 05:00:00 35.8 -53.4 FALSE 1
9 B 2012-11-08 06:00:00 NA NA TRUE 1
10 B 2012-11-08 07:00:00 36 -53.4 FALSE 1
11 B 2012-11-08 08:00:00 NA NA TRUE 2
12 B 2012-11-08 09:00:00 NA NA TRUE 2
13 B 2012-11-08 10:00:00 36.5 -53.4 FALSE 2
14 B 2012-11-08 23:00:00 36.6 -53.4 FALSE 2
15 B 2012-11-09 00:00:00 NA NA TRUE 5
16 B 2012-11-09 01:00:00 NA NA TRUE 5
17 B 2012-11-09 02:00:00 NA NA TRUE 5
18 B 2012-11-09 03:00:00 NA NA TRUE 5
19 B 2012-11-09 04:00:00 NA NA TRUE 5
20 B 2012-11-09 05:00:00 36.6 -53.5 FALSE 1"
从这里开始,我需要从数据集中删除长度为 5 NAs 或更大的任何 ID。问题是我不想删除长度为 5 的 非 NA 值的 ID(即连续超过 5 个连续 lat/lon 位置的 ID需要留下来。
在此示例中,所需的输出将只是单个 A,因为 B 的 NA 长度大于 5:
table <- "id datetime lat lon gap length
1 A 2011-10-03 05:00:00 35 -53.4 FALSE 2
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE 2
3 A 2011-10-03 07:00:00 NA NA TRUE 2
4 A 2011-10-03 08:00:00 NA NA TRUE 2
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE 3
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE 3
7 A 2011-10-03 23:00:00 36.6 -53.6 FALSE 3"
但我需要确保删除长度为 5 或更大的间隙的代码不会删除具有 lat/lon 个长度为 5 或更大的位置的 ID。我不知道从哪里开始解决我的这部分问题。
任何帮助将不胜感激
tidyverse
df %>%
group_by(id) %>%
mutate(grp = data.table::rleid(is.na(lat))) %>%
group_by(grp, .add = TRUE) %>%
mutate(res = sum(is.na(lat))) %>%
group_by(id) %>%
filter(!any(res >= 5)) %>%
select(-c(grp, res)) %>%
ungroup()
# A tibble: 7 x 4
id datetime lat lon
<chr> <dttm> <dbl> <dbl>
1 A 2011-10-03 05:00:00 35 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
data.table
library(data.table)
setDT(df)[, grp := rleid(is.na(lat)), by = list(id)] %>%
.[, grp := .N, by = list(grp, id)] %>%
.[, .SD[!any(grp >= 5)], by = id] %>%
.[]
id datetime lat lon grp
1: A 2011-10-03 05:00:00 35.0 -53.4 2
2: A 2011-10-03 06:00:00 35.1 -53.4 2
3: A 2011-10-03 07:00:00 NA NA 2
4: A 2011-10-03 08:00:00 NA NA 2
5: A 2011-10-03 09:00:00 35.1 -53.4 3
6: A 2011-10-03 10:00:00 36.2 -53.6 3
7: A 2011-10-03 23:00:00 36.6 -53.6 3