通过仅选择前几行中大于特定数字的递增值来进行子集

Question

我想做的事情看起来很简单 - 但经过 2 天的搜索，我决定 post 我的第一个问题在这里，看看是否有人可以提供帮助。

我有一个包含 5 个变量和 250,000 行的数据框 (df)。示例：

            date.time       Lat      Lon    Depth   ms
 1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5  0.391
 2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5  0.225
 3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5  0.091
 4: 2015-11-23 10:51:00 -3.233221 135.0632 400.5  0.0916
 5: 2015-11-23 10:52:00 -3.233225 135.0632 300.5  0.0333
 6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5  0.3833
 7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
 8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0  0.3166
 9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5  0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5  0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583
13: 2016-10-15 00:39:30 -3.349550 135.0998 400.5  0.0583
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5  0.0583
15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5  0.0583

我想select前n行（由是否它是递增的顺序；即 100、200、300、400、500、600 在深度值 > 500m 之前不是 100、200、400、100、50) 最大值在 500m 以上（避免重复相同的数据）。我希望这些行中的每一行都完整地出现在一个新的数据框 (newdf) 中：

             date.time       Lat      Lon    Depth   ms
 1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5  0.391
 2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5  0.225
 **3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5  0.091**
 6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5  0.3833
 7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
 8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0  0.3166
 9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5  0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5  0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
**12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583**
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5  0.0583
**15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5  0.0583**

我试过以下代码：

which_max <- which(df$Depth >= 500)
encoding <- rle(diff(df$Depth) > 0) 

# these contain the start/end indices of all continuously increasing/decreasing subsets
ends <- cumsum(encoding$lengths) + 1L
starts <- ends - encoding$lengths

# filter out the decreasing subsets
starts <- starts[encoding$values]
ends <- ends[encoding$values]

# find the one that contains the maximum
interval <- which(starts <= which_max & ends >= which_max)
out <- df[starts[interval]:ends[interval],] #picks only selected interval to print

基于之前的堆栈 post ()，但只能从我的数据集中打印 one 组 highest 值，而不是每个来自原文 (df):

            date.time      Lat      Lon     Depth   ms
 1: 2016-05-11 23:44:30 1.769763 136.6246  102.0 0.600
 2: 2016-05-11 23:53:30 1.773071 136.6247  108.0 0.7250
 3: 2016-05-11 23:54:30 1.773439 136.6247  193.0 1.4166
 4: 2016-05-11 23:55:30 1.773806 136.6248  281.5 1.475
 5: 2016-05-11 23:56:30 1.774174 136.6248  364.5 1.383
 6: 2016-05-11 23:57:30 1.774542 136.6248  447.0 1.3750
 7: 2016-05-11 23:58:30 1.774910 136.6248  528.0 1.350
 8: 2016-05-11 23:59:30 1.775278 136.6248  609.5 1.358
 9: 2016-05-12 00:00:30 1.775646 136.6248  690.0 1.3416
10: 2016-05-12 00:01:30 1.776013 136.6249  770.0 1.33333

我假设我需要使用某种类型的循环（？）但是我对编码和不确定如何去做。

编辑：我也尝试过使用“滞后”，但它不能解决需要多个增加的行或不对多行> 500 m（即500 , 550, 600, 700...)

我也用过：

df$selecteddepth <- df$Depth * (c(0, diff(df$Depth)) >= 10)

哪些 selects 深度大于 10 的差异（这意味着它们总是在增加）但不解决 selecting 超过 500m 的深度或删除重复项

这是一个使用 dput()

的子集

structure(list(date.time = structure(c(1450574990, 1450575050, 
1450575110, 1450575170, 1450575230, 1450575290, 1450575350, 1450575410, 
1450575470, 1450575530, 1450575590, 1450575650, 1450575710, 1450575770, 
1450575830, 1450575890), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Lat = c(-3.24669178745284, -3.24667124000555, -3.24665068714376, 
    -3.24663012886971, -3.24660956518562, -3.24658899609375, 
    -3.24656842159633, -3.24654784169558, -3.24652725639375, 
    -3.24650666569307, -3.24648606959577, -3.24646546810409, 
    -3.24644486122025, -3.24642424894649, -3.24640363128504, 
    -3.24638300823813), Lon = c(135.085169407522, 135.085165930176, 
    135.085162450626, 135.085158968873, 135.085155484919, 135.085151998764, 
    135.085148510411, 135.085145019861, 135.085141527116, 135.085138032177, 
    135.085134535045, 135.085131035722, 135.08512753421, 135.08512403051, 
    135.085120524624, 135.085117016552), Depth = c(373, 453, 
    500, 515.5, 521, 526.5, 512, 517.5, 522.5, 504, 522.5, 508.5, 
    481.5, 480, 474, 453), ms = c(1.60833333333333, 1.33333333333333, 
    0.783333333333333, 0.258333333333333, 0.0916666666666667, 
    0.0916666666666667, -0.241666666666667, 0.0916666666666667, 
    0.0833333333333333, -0.308333333333333, 0.308333333333333, 
    -0.233333333333333, -0.45, -0.025, -0.1, -0.35)), row.names = c(NA, 
-16L), class = c("data.table", "data.frame"))

编辑20-8-21 for det

当前输出：

如您所见，输出是降序的（970 > 929.5 > 888> 851.5...其中 ms 为正数）我需要与上面所示相同的想法，但按升序排列，这样看起来像这样（组成数据：500 > 545 > 600 > 700）并且 ms 应该是负的（大部分时间），因为动物正在潜水（负速度）。所以我需要深度返回的顶部数字小于以下数字。我希望这能澄清它！

Answer 1

这需要迭代完成，因为值在变化。想法是找到价值更大的第一个位置，然后根据该位置找到将在该组中的行并更新所需的参数。

cur_start <- 1
cur_value <- 500L
x <- df$Depth
l <- list()
i <- 1

repeat{
  
  if(cur_start > length(x)) break
  
  first_greater <- which(x[cur_start:length(x)] > cur_value)[1]
  
  if(is.na(first_greater)){
    
    break
    
  } else if(first_greater == 1){
    
    cur_start <- cur_start + 1
    next
  }
  
  pos_greater <- cur_start - 1 + first_greater
  cur_value <- x[[pos_greater]]
  
  res <- diff(x[pos_greater:cur_start]) < 0
  
  if(all(res)){
    
    l[[i]] <- cur_start:(pos_greater - 1)
    
  } else {
    
    l[[i]] <-  rev(pos_greater - seq_len(which.min(res) - 1))
  }

  cur_start <- pos_greater + 1
  i <- i + 1
}

lapply(l, function(x) df[x,])

通过仅选择前几行中大于特定数字的递增值来进行子集

Subset by selecting only increasing values in previous rows greater than a certain number

diff

loops

r

subset

cumsum