Select 从最后一次观察开始，df 中至少相隔半年的观察

Question

我的问题介绍

我有一个数据框，每个人的观察结果不相等，我只想包含相差超过一半的观察结果。
我想 select 每个人 last 观察 first 然后 select 下一个观察至少早了半年。

示例 df

ID     Date        Var1 ... Var12
100    13/02/2012    x       x
100    14/09/2012    x       x
100    31/01/2013    x       x
100    18/12/2012    x       x
101    29/04/2012    x       x
102    01/11/2012    x       x
103    12/08/2012    x       x
103    22/08/2013    x       x
103    26/09/2013    x       x
103    22/01/2014    x       x
104    19/01/2012    x       x 
104    17/02/2014    x       x
104    15/03/2014    x       x
104    12/05/2015    x       x

在 select 进行正确的观察后

预期的 df 应该看起来像这样

ID     Date        Var1 ... Var12
100    13/02/2012    x       x
100    14/09/2012    x       x
100    18/12/2013    x       x
101    29/04/2012    x       x
102    01/11/2012    x       x
103    12/08/2012    x       x
103    22/08/2013    x       x
103    22/01/2014    x       x
104    19/01/2012    x       x 
104    17/02/2014    x       x
104    12/05/2015    x       x

我试过的

我试图写一个循环，但我无法处理 selection 问题。预先感谢您的任何建议

Answer 1

这很丑陋，但似乎可行。加载一些东西：

library(data.table)
library(plyr)

dt <- fread("ID     Date        Var1 Var12
100    13/02/2012    x       x
100    14/09/2012    x       x
100    31/01/2013    x       x
100    18/12/2012    x       x
101    29/04/2012    x       x
102    01/11/2012    x       x
103    12/08/2012    x       x
103    22/08/2013    x       x
103    26/09/2013    x       x
103    22/01/2014    x       x
104    19/01/2012    x       x 
104    17/02/2014    x       x
104    15/03/2014    x       x
104    12/05/2015    x       x")

df <- as.data.frame(dt)
df$Date <- as.Date(df$Date, format="%d/%m/%Y")

施展diff魔法。请注意，您可以在此处将 threshold 更改为您想要的任何内容。

threshold <- 180
ddply(df, .(ID), function(x) {
    x <- x[order(x[,2], decreasing=T),]
    sel <- diff(x[,2]) < -threshold
    sel2 <- diff(x[,2])
    sel2[!sel] <- cumsum(as.numeric(diff(x[,2])))[!sel]

    x[c(1,which(sel2 < -threshold)+1),]
})

diff 和 cumsum 的丑陋混乱对每个 ID 执行以下操作：

按日期降序排列
计算观测值之间的日期差异，并标记哪些超出 threshold
将未标记的观测值的日期差异替换为累计总和
抓住现在标记的那些

瞧瞧

    ID       Date Var1 Var12
1  100 2013-01-31    x     x
2  100 2012-02-13    x     x
3  101 2012-04-29    x     x
4  102 2012-11-01    x     x
5  103 2014-01-22    x     x
6  103 2012-08-12    x     x
7  104 2015-05-12    x     x
8  104 2014-03-15    x     x
9  104 2014-02-17    x     x
10 104 2012-01-19    x     x

Select 从最后一次观察开始，df 中至少相隔半年的观察

Select only observations in a df that are minimally a half year apart, starting from the last observation

r

selection

dataframe

我的问题介绍

示例 df

我试过的