通过仅选择前几行中大于特定数字的递增值来进行子集
Subset by selecting only increasing values in previous rows greater than a certain number
我想做的事情看起来很简单 - 但经过 2 天的搜索,我决定 post 我的第一个问题在这里,看看是否有人可以提供帮助。
我有一个包含 5 个变量和 250,000 行的数据框 (df)。
示例:
date.time Lat Lon Depth ms
1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5 0.391
2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5 0.225
3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5 0.091
4: 2015-11-23 10:51:00 -3.233221 135.0632 400.5 0.0916
5: 2015-11-23 10:52:00 -3.233225 135.0632 300.5 0.0333
6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5 0.3833
7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0 0.3166
9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5 0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5 0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583
13: 2016-10-15 00:39:30 -3.349550 135.0998 400.5 0.0583
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5 0.0583
15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5 0.0583
我想select前n行(由是否
它是递增的顺序;即 100、200、300、400、500、600
在深度值 > 500m 之前不是 100、200、400、100、50)
最大值在 500m 以上(避免重复相同的数据)。
我希望这些行中的每一行都完整地出现在一个新的数据框 (newdf) 中:
date.time Lat Lon Depth ms
1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5 0.391
2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5 0.225
**3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5 0.091**
6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5 0.3833
7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0 0.3166
9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5 0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5 0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
**12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583**
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5 0.0583
**15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5 0.0583**
我试过以下代码:
which_max <- which(df$Depth >= 500)
encoding <- rle(diff(df$Depth) > 0)
# these contain the start/end indices of all continuously increasing/decreasing subsets
ends <- cumsum(encoding$lengths) + 1L
starts <- ends - encoding$lengths
# filter out the decreasing subsets
starts <- starts[encoding$values]
ends <- ends[encoding$values]
# find the one that contains the maximum
interval <- which(starts <= which_max & ends >= which_max)
out <- df[starts[interval]:ends[interval],] #picks only selected interval to print
基于之前的堆栈 post (),
但只能从我的数据集中打印 one 组 highest 值,而不是每个
来自原文 (df):
date.time Lat Lon Depth ms
1: 2016-05-11 23:44:30 1.769763 136.6246 102.0 0.600
2: 2016-05-11 23:53:30 1.773071 136.6247 108.0 0.7250
3: 2016-05-11 23:54:30 1.773439 136.6247 193.0 1.4166
4: 2016-05-11 23:55:30 1.773806 136.6248 281.5 1.475
5: 2016-05-11 23:56:30 1.774174 136.6248 364.5 1.383
6: 2016-05-11 23:57:30 1.774542 136.6248 447.0 1.3750
7: 2016-05-11 23:58:30 1.774910 136.6248 528.0 1.350
8: 2016-05-11 23:59:30 1.775278 136.6248 609.5 1.358
9: 2016-05-12 00:00:30 1.775646 136.6248 690.0 1.3416
10: 2016-05-12 00:01:30 1.776013 136.6249 770.0 1.33333
我假设我需要使用某种类型的循环(?)但是我对编码和
不确定如何去做。
编辑:我也尝试过使用“滞后”,但它不能解决需要多个增加的行或不对多行> 500 m(即500 , 550, 600, 700...)
我也用过:
df$selecteddepth <- df$Depth * (c(0, diff(df$Depth)) >= 10)
哪些 selects 深度大于 10 的差异(这意味着它们总是在增加)但不解决 selecting 超过 500m 的深度或删除重复项
这是一个使用 dput()
的子集
structure(list(date.time = structure(c(1450574990, 1450575050,
1450575110, 1450575170, 1450575230, 1450575290, 1450575350, 1450575410,
1450575470, 1450575530, 1450575590, 1450575650, 1450575710, 1450575770,
1450575830, 1450575890), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Lat = c(-3.24669178745284, -3.24667124000555, -3.24665068714376,
-3.24663012886971, -3.24660956518562, -3.24658899609375,
-3.24656842159633, -3.24654784169558, -3.24652725639375,
-3.24650666569307, -3.24648606959577, -3.24646546810409,
-3.24644486122025, -3.24642424894649, -3.24640363128504,
-3.24638300823813), Lon = c(135.085169407522, 135.085165930176,
135.085162450626, 135.085158968873, 135.085155484919, 135.085151998764,
135.085148510411, 135.085145019861, 135.085141527116, 135.085138032177,
135.085134535045, 135.085131035722, 135.08512753421, 135.08512403051,
135.085120524624, 135.085117016552), Depth = c(373, 453,
500, 515.5, 521, 526.5, 512, 517.5, 522.5, 504, 522.5, 508.5,
481.5, 480, 474, 453), ms = c(1.60833333333333, 1.33333333333333,
0.783333333333333, 0.258333333333333, 0.0916666666666667,
0.0916666666666667, -0.241666666666667, 0.0916666666666667,
0.0833333333333333, -0.308333333333333, 0.308333333333333,
-0.233333333333333, -0.45, -0.025, -0.1, -0.35)), row.names = c(NA,
-16L), class = c("data.table", "data.frame"))
编辑20-8-21 for det
当前输出:
如您所见,输出是降序的(970 > 929.5 > 888> 851.5...其中 ms 为正数)我需要与上面所示相同的想法,但按升序排列,这样看起来像这样(组成数据:500 > 545 > 600 > 700)并且 ms 应该是负的(大部分时间),因为动物正在潜水(负速度)。所以我需要深度返回的顶部数字小于以下数字。我希望这能澄清它!
这需要迭代完成,因为值在变化。想法是找到价值更大的第一个位置,然后根据该位置找到将在该组中的行并更新所需的参数。
cur_start <- 1
cur_value <- 500L
x <- df$Depth
l <- list()
i <- 1
repeat{
if(cur_start > length(x)) break
first_greater <- which(x[cur_start:length(x)] > cur_value)[1]
if(is.na(first_greater)){
break
} else if(first_greater == 1){
cur_start <- cur_start + 1
next
}
pos_greater <- cur_start - 1 + first_greater
cur_value <- x[[pos_greater]]
res <- diff(x[pos_greater:cur_start]) < 0
if(all(res)){
l[[i]] <- cur_start:(pos_greater - 1)
} else {
l[[i]] <- rev(pos_greater - seq_len(which.min(res) - 1))
}
cur_start <- pos_greater + 1
i <- i + 1
}
lapply(l, function(x) df[x,])
我想做的事情看起来很简单 - 但经过 2 天的搜索,我决定 post 我的第一个问题在这里,看看是否有人可以提供帮助。
我有一个包含 5 个变量和 250,000 行的数据框 (df)。 示例:
date.time Lat Lon Depth ms
1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5 0.391
2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5 0.225
3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5 0.091
4: 2015-11-23 10:51:00 -3.233221 135.0632 400.5 0.0916
5: 2015-11-23 10:52:00 -3.233225 135.0632 300.5 0.0333
6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5 0.3833
7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0 0.3166
9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5 0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5 0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583
13: 2016-10-15 00:39:30 -3.349550 135.0998 400.5 0.0583
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5 0.0583
15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5 0.0583
我想select前n行(由是否 它是递增的顺序;即 100、200、300、400、500、600 在深度值 > 500m 之前不是 100、200、400、100、50) 最大值在 500m 以上(避免重复相同的数据)。 我希望这些行中的每一行都完整地出现在一个新的数据框 (newdf) 中:
date.time Lat Lon Depth ms
1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5 0.391
2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5 0.225
**3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5 0.091**
6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5 0.3833
7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0 0.3166
9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5 0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5 0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
**12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583**
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5 0.0583
**15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5 0.0583**
我试过以下代码:
which_max <- which(df$Depth >= 500)
encoding <- rle(diff(df$Depth) > 0)
# these contain the start/end indices of all continuously increasing/decreasing subsets
ends <- cumsum(encoding$lengths) + 1L
starts <- ends - encoding$lengths
# filter out the decreasing subsets
starts <- starts[encoding$values]
ends <- ends[encoding$values]
# find the one that contains the maximum
interval <- which(starts <= which_max & ends >= which_max)
out <- df[starts[interval]:ends[interval],] #picks only selected interval to print
基于之前的堆栈 post (
date.time Lat Lon Depth ms
1: 2016-05-11 23:44:30 1.769763 136.6246 102.0 0.600
2: 2016-05-11 23:53:30 1.773071 136.6247 108.0 0.7250
3: 2016-05-11 23:54:30 1.773439 136.6247 193.0 1.4166
4: 2016-05-11 23:55:30 1.773806 136.6248 281.5 1.475
5: 2016-05-11 23:56:30 1.774174 136.6248 364.5 1.383
6: 2016-05-11 23:57:30 1.774542 136.6248 447.0 1.3750
7: 2016-05-11 23:58:30 1.774910 136.6248 528.0 1.350
8: 2016-05-11 23:59:30 1.775278 136.6248 609.5 1.358
9: 2016-05-12 00:00:30 1.775646 136.6248 690.0 1.3416
10: 2016-05-12 00:01:30 1.776013 136.6249 770.0 1.33333
我假设我需要使用某种类型的循环(?)但是我对编码和 不确定如何去做。
编辑:我也尝试过使用“滞后”,但它不能解决需要多个增加的行或不对多行> 500 m(即500 , 550, 600, 700...)
我也用过:
df$selecteddepth <- df$Depth * (c(0, diff(df$Depth)) >= 10)
哪些 selects 深度大于 10 的差异(这意味着它们总是在增加)但不解决 selecting 超过 500m 的深度或删除重复项
这是一个使用 dput()
的子集structure(list(date.time = structure(c(1450574990, 1450575050,
1450575110, 1450575170, 1450575230, 1450575290, 1450575350, 1450575410,
1450575470, 1450575530, 1450575590, 1450575650, 1450575710, 1450575770,
1450575830, 1450575890), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Lat = c(-3.24669178745284, -3.24667124000555, -3.24665068714376,
-3.24663012886971, -3.24660956518562, -3.24658899609375,
-3.24656842159633, -3.24654784169558, -3.24652725639375,
-3.24650666569307, -3.24648606959577, -3.24646546810409,
-3.24644486122025, -3.24642424894649, -3.24640363128504,
-3.24638300823813), Lon = c(135.085169407522, 135.085165930176,
135.085162450626, 135.085158968873, 135.085155484919, 135.085151998764,
135.085148510411, 135.085145019861, 135.085141527116, 135.085138032177,
135.085134535045, 135.085131035722, 135.08512753421, 135.08512403051,
135.085120524624, 135.085117016552), Depth = c(373, 453,
500, 515.5, 521, 526.5, 512, 517.5, 522.5, 504, 522.5, 508.5,
481.5, 480, 474, 453), ms = c(1.60833333333333, 1.33333333333333,
0.783333333333333, 0.258333333333333, 0.0916666666666667,
0.0916666666666667, -0.241666666666667, 0.0916666666666667,
0.0833333333333333, -0.308333333333333, 0.308333333333333,
-0.233333333333333, -0.45, -0.025, -0.1, -0.35)), row.names = c(NA,
-16L), class = c("data.table", "data.frame"))
编辑20-8-21 for det
当前输出:
如您所见,输出是降序的(970 > 929.5 > 888> 851.5...其中 ms 为正数)我需要与上面所示相同的想法,但按升序排列,这样看起来像这样(组成数据:500 > 545 > 600 > 700)并且 ms 应该是负的(大部分时间),因为动物正在潜水(负速度)。所以我需要深度返回的顶部数字小于以下数字。我希望这能澄清它!
这需要迭代完成,因为值在变化。想法是找到价值更大的第一个位置,然后根据该位置找到将在该组中的行并更新所需的参数。
cur_start <- 1
cur_value <- 500L
x <- df$Depth
l <- list()
i <- 1
repeat{
if(cur_start > length(x)) break
first_greater <- which(x[cur_start:length(x)] > cur_value)[1]
if(is.na(first_greater)){
break
} else if(first_greater == 1){
cur_start <- cur_start + 1
next
}
pos_greater <- cur_start - 1 + first_greater
cur_value <- x[[pos_greater]]
res <- diff(x[pos_greater:cur_start]) < 0
if(all(res)){
l[[i]] <- cur_start:(pos_greater - 1)
} else {
l[[i]] <- rev(pos_greater - seq_len(which.min(res) - 1))
}
cur_start <- pos_greater + 1
i <- i + 1
}
lapply(l, function(x) df[x,])