带 NA 处理的回顾

Lookback with NA handling

我有一些包含日期、ID 和值的数据。我想添加一个名为 "bad_perf" 的列,按 ID 查找今天和前两天的值,然后在所有 2 天都小于 10 时分配 1。如果今天的数据为 NA,则分配a 0,如果前2天有NA,赋0。如果用完数据,赋0。

这是数据:

asof_dt<-mdy("11/14/2014","11/21/2014","11/28/2014","12/5/2014","4/25/2014","5/2/2014","5/9/2014","5/16/2014","5/23/2014","5/30/2014","6/6/2014")
  id<-c("ABC","ABC","ABC","ABC","XYZ","XYZ","XYZ","XYZ","XYZ","XYZ","XYZ")
  value<-c(7,8,3,10,11,10,1,NA,9,3,10)
  df<-data.frame(asof_dt,id,value)   


> df
     asof_dt  id value
1  2014-11-14 ABC     7
2  2014-11-21 ABC     8
3  2014-11-28 ABC     3
4  2014-12-05 ABC    10
5  2014-04-25 XYZ    11
6  2014-05-02 XYZ    10
7  2014-05-09 XYZ     1
8  2014-05-16 XYZ    NA
9  2014-05-23 XYZ     9
10 2014-05-30 XYZ     3
11 2014-06-06 XYZ    10

这是我期望的结果和我的评论,希望能带来更多清晰度。

        asof_dt  id value   bad_perf    Comment
  11/14/2014    ABC 7   0   Assigned 0; not enough data
  11/21/2014    ABC 8   0   Assigned 0; not enough data
  11/28/2014    ABC 3   1   Assigned 1; this record and the previous 2 records are less than or equal to 
  12/5/2014     ABC 10  1   Assigned 1; this record and the previous 2 records are less than or equal to 
  4/25/2014     XYZ 11  0   Assigned 0; not enough data
  5/2/2014      XYZ 10  0   Assigned 0; not enough data
  5/9/2014      XYZ 1   0   Assigned 0; previous 2 records are not less than or equal to 10
  5/16/2014     XYZ NA  0   Assigned 0; current value is NA
  5/23/2014     XYZ 9   0   Assigned 0; at least 1 NA
  5/30/2014     XYZ 3   0   Assigned 0; at least 1 NA
  6/6/2014      XYZ 10  1   Assigned 1; this record and the previous 2 records are less than or equal to 

很遗憾,不确定如何开始。我现在在 Excel 中执行此步骤!

非常感谢!

您可以尝试使用 base R 方法 (embed) 在将 "value" 列拆分为 "id" 后创建 "lags"。然后检查每一行的所有元素是否都小于10(rowSums(...)),unlist得到index.

df$bad_perf <- unlist(sapply(split(df$value, df$id), function(x) {
               x1 <-embed(c(rep(NA,2), x), 2)
          as.numeric(rowSums(cbind(x, x1[-nrow(x1),])<=10, na.rm=TRUE)==3)
           }), use.names=FALSE)

或者您可以使用 data.table 的开发版本,它引入了函数 shift 来获取 "lag" 列,然后像之前一样执行 rowSums解决方案。

library(data.table) #data.table_1.9.5
df1 <- copy(df) 
df1$bad_perf <- setDT(df)[,shift(value, n=0:2L) , id][,
                 (rowSums(.SD<=10, na.rm=TRUE)==3)+0L,.SDcols=2:4][]

或使用dplyr,可以生成滞后列。

df1 <- df %>% 
          group_by(id) %>% 
          mutate(value1=lag(value), value2=lag(value, 2L))

df$bad_perf <- (rowSums(df1[3:5]<=10, na.rm=TRUE)==3)+0
df$bad_perf
#[1] 0 0 1 1 0 0 0 0 0 0 1