R如何根据以前的值过滤时间序列的测量值

R how to filter a time series of measurements based on previous values

我正在尝试过滤时间序列中的珊瑚人口统计数据。我有一组珊瑚,每 3 个月测量一次。我想要做的是 a.) 过滤所有在某个点具有指定尺寸范围(8-12 毫米直径)内最大直径的珊瑚,b.) 去除之前大于尺寸范围的珊瑚,c .) 通过仅包括每个珊瑚生长到尺寸范围(8-12 毫米)的第一个测量值和下一个后续测量值,删除在其生长超过或超过尺寸范围后落入尺寸范围的珊瑚测量值时间步长。

我已经创建了一个示例数据库和所需的数据库来具体说明我正在寻找的内容。在示例数据库中,我还在每个珊瑚的第一个条目旁边的注释部分中包含了下面为每个珊瑚列出的所有标准,供您参考。以下是我在数据库中包含的 8 种珊瑚,以及我想对它们进行的文字处理:

珊瑚 #1 应该从数据库中完全删除,因为它跳过了所需的 8-12 毫米尺寸范围

珊瑚 #2 应该从数据库中删除,因为它开始时高于所需的大小范围,然后缩小到低于该范围,然后又长大。我只想要已经长到尺寸范围内没有事先缩小的珊瑚

珊瑚 # 3 是珊瑚的一个例子,它可以长到这个尺寸范围 (8-12 毫米) 并且不会收缩,而且这是我想保留的珊瑚,因为它长到这个尺寸范围。但是,我只想包括尺寸范围内的第一个测量(在本例中为 TimeStep 3 中的 9 毫米)和后续测量(在本例中为 TimeStep 4 中的 12 毫米)

4 号珊瑚是一个珊瑚的例子,它开始离开并保持在尺寸范围以上,因此需要被移除。

珊瑚 # 5 是珊瑚的一个例子,它开始低于范围,长到范围内,然后又缩回到范围内(TimeStep 4)。对于这种情况,我只想包括直径第一次落入范围 (TimeStep 2) 和后续测量 (TimeStep 3),而不是第二次落入范围。这是因为第一次是自然增长而第二次是收缩及其导致的恢复(我想排除或过滤掉)。

珊瑚 # 6 是珊瑚的一个例子,它在 TimeStep 1 的大小范围内开始,然后在下一个 TimeStep 中超出它,并在之后继续生长。我只想保留 TimeStep 1 和 2 中的测量值(范围内的第一个测量值和后续测量值)

Coral #7 是珊瑚的一个例子,它在 TimeStep 1 的大小范围内开始,然后保持在 TimeStep 2 的范围内。在这种情况下,我只想要大小范围(TimeStep 1)中的第一个测量值和随后的测量(TimeStep 2)

珊瑚 # 8 是珊瑚的一个例子,它在 TimeStep 3 中增长到大小范围,在 TimeStep 4 中保持在范围 (10 => 9) 内,然后缩小到所需范围以下,然后在 TimeStep 6 中长回来到范围。对于这个群体,我再次希望在范围内进行第一次测量(在时间步长 3 处为 10 毫米),并在时间步长 4 中进行后续测量,包括该珊瑚

9 号珊瑚是珊瑚的一个例子,它在 TimeStep 3 中生长到尺寸范围(9 毫米),但在随后的 TimeStep 中没有找到(状态代码列的 NF,测量值为 NA)。我想在数据集中保留这样的珊瑚以计算存活率。

总而言之,我想要过滤这个数据库的代码,这样如果珊瑚在某个点的直径在 8-12 厘米大小范围内,但之前大于该范围,则永远不会处于或低于该范围,或者开始低于范围但从未落入范围内,它们将完全从数据库中删除。此外,我希望保留任何生长到范围内然后缩小到数据库中的珊瑚,同时删除第二次落入范围内的珊瑚。这将通过删除所有测量值来完成,除了珊瑚生长到尺寸范围内的第一个 TimeStep 和随后的 TimeStep 测量值。

示例数据库

data <- structure(list(Site = c("WAI", "WAI", "WAI", "WAI", "WAI", "WAI", 
"WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", 
"WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", 
"WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", 
"WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", 
"WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", "WAI", 
"WAI"), `Module #` = c(116, 116, 116, 116, 116, 116, 116, 115, 
115, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 
116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 
116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 
116, 116, 116, 116, 116), Side = c("N", "N", "N", "N", "N", "N", 
"N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
"N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
"N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", 
"N", "N", "N", "N", "N", "N", "N"), TimeStep = c(1, 2, 3, 4, 
5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 
2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 
5, 6, 1, 2, 3, 4), Settlement_Area = c(0.75902336, 0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 
0.75902336, 0.75902336), `Colony #` = c(1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 
5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 
9, 9, 9), Location = c("C1", "C1", "C1", "C1", "C1", "C1", "B4", 
"B4", "B4", "B4", "B4", "B4", "A1", "A1", "A1", "A1", "A1", "A1", 
"B3", "B3", "B3", "B3", "B3", "B3", "D1", "D1", "D1", "D1", "D1", 
"D1", "A2", "A2", "A2", "A2", "A2", "A2", "A4", "A4", "A4", "A4", 
"A4", "A4", "B3", "B3", "B3", "B3", "B3", "B3", "A3", "A3", "A3", 
"A3"), `Taxonomic Code` = c("PC", "PC", "PC", "PC", "PC", "PC", 
"PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", 
"PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", 
"PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", 
"PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", "PC", 
"PC", "PC"), `Cover Code` = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA), 
    `Max Diameter (cm)` = c(5, 7, 13, 15, 16, 19, 15, 7, 9, 11, 
    14, 18, 3, 6, 9, 12, 15, 20, 13, 16, 18, 21, 23, 26, 6, 9, 
    14, 12, 15, 18, 11, 14, 17, 17, 21, 24, 9, 11, 14, 16, 20, 
    22, 3, 6, 10, 9, 7, 10, 4, 6, 9, NA), `Status Code` = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, "NF"), Notes = c("coral # 1 should be deleted from the database because it skipped over the desired size range of 8-12 mm", 
    NA, NA, NA, NA, NA, "coral # 2 should be deleted from the database because it started above the desired size range then shrank back into it.  I only want corals that have grown into the size range", 
    NA, NA, NA, NA, NA, "Colony # 3 is an example of a coral that grew to the size range (8-12 mm) and beyond without shrinking and this is a coral that I want to keep because it grew to the size range.  However, I want to only include the FIRST measure inside the size range (9 mm in this case) and the proceeding measurement (12 mm)", 
    NA, NA, NA, NA, NA, "Colony # 4 is an example of a coral that started off above the size range and therefore needs to be removed.", 
    NA, NA, NA, NA, NA, "Colony # 5 is an example of a coral that started below the range, grew into it, then later shrank back into the range (TimeStep 4). For this scenario, I want to only include the first time the diameter fell into the range (TimeStep 2) and the proceeding measurement, not the second time it fell into the range. This is because the first time is natural growth whereas the second time is shrinkage and its resulting recovery (which I want to exclude or filter out).", 
    NA, NA, NA, NA, NA, "Colony # 6 is an example of a coral that started in the size range for TimeStep 1 and then grew out of it in the next TimeStep and continued to grow after. I want to maintain only the measurements in TimeStep 1 and 2 (the first measure inside the range and the proceeding measurement)", 
    NA, NA, NA, NA, NA, "Colony # 7 is an example of a coral that started in the size range in TimeStep 1 and then remained in the range for TimeStep 2. In this case I only want the first measurement in the size range (TimeStep 1) and the subsequent measurement (TimeStep 2)", 
    NA, NA, NA, NA, NA, "Colony # 8 is an example of a coral that grew to the size range in TimeStep 3, stayed in the range (10 => 9) in TimeStep 4, then shrank below the desired range then for TimeStep 6 grew back to the range. For this colony, again I want the FIRST measurement inside the range (10 mm at TimeStep 3) and the proceeding measurement in TimeStep 4 included for this coral", 
    NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -52L), spec = structure(list(
    cols = list(Site = structure(list(), class = c("collector_character", 
    "collector")), `Module #` = structure(list(), class = c("collector_double", 
    "collector")), Side = structure(list(), class = c("collector_character", 
    "collector")), TimeStep = structure(list(), class = c("collector_double", 
    "collector")), Settlement_Area = structure(list(), class = c("collector_double", 
    "collector")), `Colony #` = structure(list(), class = c("collector_double", 
    "collector")), Location = structure(list(), class = c("collector_character", 
    "collector")), `Taxonomic Code` = structure(list(), class = c("collector_character", 
    "collector")), `Cover Code` = structure(list(), class = c("collector_double", 
    "collector")), `Max Diameter (cm)` = structure(list(), class = c("collector_double", 
    "collector")), `Status Code` = structure(list(), class = c("collector_character", 
    "collector")), Notes = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

所需的数据库

data_final <- structure(list(Site = c("WAI", "WAI", "WAI", "WAI", "WAI", "WAI", 
"WAI", "WAI", "WAI", "WAI", "WAI", "WAI"), `Module #` = c(116, 
116, 116, 116, 116, 116, 116, 116, 116, 116, 116, 116), Side = c("N", 
"N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"), TimeStep = c(3, 
4, 2, 3, 1, 2, 1, 2, 3, 4, 3, 4), Settlement_Area = c(0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336, 
0.75902336, 0.75902336, 0.75902336, 0.75902336, 0.75902336), 
    `Colony #` = c(3, 3, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9), Location = c("A1", 
    "A1", "D1", "D1", "A2", "A2", "A4", "A4", "B3", "B3", "B2", 
    "B2"), `Taxonomic Code` = c("PC", "PC", "PC", "PC", "PC", 
    "PC", "PC", "PC", "PC", "PC", "PC", "PC"), `Cover Code` = c(1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA), `Max Diameter (cm)` = c(9, 
    12, 9, 14, 11, 14, 9, 11, 10, 9, 9, NA), `Status Code` = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "NF")), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L), spec = structure(list(
    cols = list(Site = structure(list(), class = c("collector_character", 
    "collector")), `Module #` = structure(list(), class = c("collector_double", 
    "collector")), Side = structure(list(), class = c("collector_character", 
    "collector")), TimeStep = structure(list(), class = c("collector_double", 
    "collector")), Settlement_Area = structure(list(), class = c("collector_double", 
    "collector")), `Colony #` = structure(list(), class = c("collector_double", 
    "collector")), Location = structure(list(), class = c("collector_character", 
    "collector")), `Taxonomic Code` = structure(list(), class = c("collector_character", 
    "collector")), `Cover Code` = structure(list(), class = c("collector_double", 
    "collector")), `Max Diameter (cm)` = structure(list(), class = c("collector_double", 
    "collector")), `Status Code` = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

到目前为止,我已经能够通过创建一个确实落在 8 到 12 毫米的独特菌落数向量来获得从未在尺寸范围内的珊瑚:

size_vect <- seq(from = 8, to = 12, by = 1)
# a vector containing the range of diameter measurements we want to filter for

ID_vect <- data %>% group_by(`Colony #`) %>% 
filter(`Max Diameter (cm)` > min(size_vect) & `Max Diameter (cm)` < max(size_vect)) %>% 
# select all measures where the coral fell within the size range
distinct(`Colony #`) %>% 
# remove duplicate colony numbers
pull(`Colony #`)
# make the column `Colony #` in the dataframe ID_vect into a vector

然后我过滤了完整的样本数据库,只包含来自 ID_vect:

的珊瑚群落
data_new <- data %>% group_by(`Colony #`) %>%
filter(`Colony #` %in% ID_vect) 
# filter for all corals that contain the same colony number as those in the ID_vect

我不知道现在如何根据以下条件过滤数据库:如果珊瑚在某个时候落入尺寸范围,但之前的测量值大于所需尺寸范围的最大值(12 毫米), 那个珊瑚应该被完全移除。例如 Coral #2 应该被删除,因为在该值落入 TimeStep 3 的范围之前,它在 TimeStep 1 中为 15 mm,超出了范围。

此外,我不知道如何解释下一个 TimeStep 测量中是否没有测量值,例如 Coral #9 在 TimeStep 3 中测量为 9 mm 并且未找到(状态代码中的 NF) ) 在 TimeStep 4 中。我需要保留 TimeStep 4 测量值来计算生存率。我不知道如何编写此条件过滤器的代码,这是我需要帮助的地方。感谢任何代码建议!

我们可以使用 运行 长度编码来帮助我们跟上从范围内到范围外的转换。使用 data.table::rleid 更容易,我会推荐使用它。

这是 RLE 在珊瑚 8 上的作用示例。

 `Colony #` `Max Diameter (cm)` InRange RLE
          8                   3   FALSE   1
          8                   6   FALSE   1
          8                  10    TRUE   2
          8                   9    TRUE   2
          8                   7   FALSE   3
          8                  10    TRUE   4

对 RLE 进行编码后,我们将筛选出最小范围内 RLE 低于最小范围以上 RLE 的行。如果存在任何此类行,我们将查找范围内的第一个时间点,并在下一个时间点进行过滤。

library(dplyr)
library(data.table)
data %>% 
  select(-Notes) %>%
  mutate(InRange = case_when(`Max Diameter (cm)` >= 8 & `Max Diameter (cm)` <= 12 ~ TRUE,
                             TRUE ~ FALSE)) %>% 
  mutate(AboveRange = case_when(`Max Diameter (cm)` > 12 ~ TRUE,
                                TRUE ~ FALSE)) %>% 
  group_by(`Colony #`) %>%
  mutate(RLE = data.table::rleid(InRange)) %>% 
  mutate(MinIn = min(RLE[InRange]), MinAbove = min(RLE[AboveRange]), MinInTime = min(TimeStep[InRange])) %>%
  filter(MinIn < MinAbove & (TimeStep == MinInTime | (TimeStep == MinInTime + 1))) %>% 
  select(-InRange,-AboveRange,-RLE,-MinIn,-MinAbove,-MinInTime)
## A tibble: 12 x 11
## Groups:   Colony # [6]
#   Site  `Module #` Side  TimeStep Settlement_Area `Colony #` Location `Taxonomic Code` `Cover Code` `Max Diameter (cm)` `Status Code`
#   <chr>      <dbl> <chr>    <dbl>           <dbl>      <dbl> <chr>    <chr>                   <dbl>               <dbl> <chr>        
# 1 WAI          116 N            3           0.759          3 A1       PC                          1                   9 NA           
# 2 WAI          116 N            4           0.759          3 A1       PC                          1                  12 NA           
# 3 WAI          116 N            2           0.759          5 D1       PC                          1                   9 NA           
# 4 WAI          116 N            3           0.759          5 D1       PC                          1                  14 NA           
# 5 WAI          116 N            1           0.759          6 A2       PC                          1                  11 NA           
# 6 WAI          116 N            2           0.759          6 A2       PC                          1                  14 NA           
# 7 WAI          116 N            1           0.759          7 A4       PC                          1                   9 NA           
# 8 WAI          116 N            2           0.759          7 A4       PC                          1                  11 NA           
# 9 WAI          116 N            3           0.759          8 B3       PC                          1                  10 NA           
#10 WAI          116 N            4           0.759          8 B3       PC                          1                   9 NA           
#11 WAI          116 N            3           0.759          9 A3       PC                          1                   9 NA           
#12 WAI          116 N            4           0.759          9 A3       PC                         NA                  NA NF