带 NA 处理的回顾
Lookback with NA handling
我有一些包含日期、ID 和值的数据。我想添加一个名为 "bad_perf" 的列,按 ID 查找今天和前两天的值,然后在所有 2 天都小于 10 时分配 1。如果今天的数据为 NA,则分配a 0,如果前2天有NA,赋0。如果用完数据,赋0。
这是数据:
asof_dt<-mdy("11/14/2014","11/21/2014","11/28/2014","12/5/2014","4/25/2014","5/2/2014","5/9/2014","5/16/2014","5/23/2014","5/30/2014","6/6/2014")
id<-c("ABC","ABC","ABC","ABC","XYZ","XYZ","XYZ","XYZ","XYZ","XYZ","XYZ")
value<-c(7,8,3,10,11,10,1,NA,9,3,10)
df<-data.frame(asof_dt,id,value)
> df
asof_dt id value
1 2014-11-14 ABC 7
2 2014-11-21 ABC 8
3 2014-11-28 ABC 3
4 2014-12-05 ABC 10
5 2014-04-25 XYZ 11
6 2014-05-02 XYZ 10
7 2014-05-09 XYZ 1
8 2014-05-16 XYZ NA
9 2014-05-23 XYZ 9
10 2014-05-30 XYZ 3
11 2014-06-06 XYZ 10
这是我期望的结果和我的评论,希望能带来更多清晰度。
asof_dt id value bad_perf Comment
11/14/2014 ABC 7 0 Assigned 0; not enough data
11/21/2014 ABC 8 0 Assigned 0; not enough data
11/28/2014 ABC 3 1 Assigned 1; this record and the previous 2 records are less than or equal to
12/5/2014 ABC 10 1 Assigned 1; this record and the previous 2 records are less than or equal to
4/25/2014 XYZ 11 0 Assigned 0; not enough data
5/2/2014 XYZ 10 0 Assigned 0; not enough data
5/9/2014 XYZ 1 0 Assigned 0; previous 2 records are not less than or equal to 10
5/16/2014 XYZ NA 0 Assigned 0; current value is NA
5/23/2014 XYZ 9 0 Assigned 0; at least 1 NA
5/30/2014 XYZ 3 0 Assigned 0; at least 1 NA
6/6/2014 XYZ 10 1 Assigned 1; this record and the previous 2 records are less than or equal to
很遗憾,不确定如何开始。我现在在 Excel 中执行此步骤!
非常感谢!
您可以尝试使用 base R
方法 (embed
) 在将 "value" 列拆分为 "id" 后创建 "lags"。然后检查每一行的所有元素是否都小于10(rowSums(...)
),unlist
得到index.
df$bad_perf <- unlist(sapply(split(df$value, df$id), function(x) {
x1 <-embed(c(rep(NA,2), x), 2)
as.numeric(rowSums(cbind(x, x1[-nrow(x1),])<=10, na.rm=TRUE)==3)
}), use.names=FALSE)
或者您可以使用 data.table 的开发版本,它引入了函数 shift
来获取 "lag" 列,然后像之前一样执行 rowSums
解决方案。
library(data.table) #data.table_1.9.5
df1 <- copy(df)
df1$bad_perf <- setDT(df)[,shift(value, n=0:2L) , id][,
(rowSums(.SD<=10, na.rm=TRUE)==3)+0L,.SDcols=2:4][]
或使用dplyr
,可以生成滞后列。
df1 <- df %>%
group_by(id) %>%
mutate(value1=lag(value), value2=lag(value, 2L))
df$bad_perf <- (rowSums(df1[3:5]<=10, na.rm=TRUE)==3)+0
df$bad_perf
#[1] 0 0 1 1 0 0 0 0 0 0 1
我有一些包含日期、ID 和值的数据。我想添加一个名为 "bad_perf" 的列,按 ID 查找今天和前两天的值,然后在所有 2 天都小于 10 时分配 1。如果今天的数据为 NA,则分配a 0,如果前2天有NA,赋0。如果用完数据,赋0。
这是数据:
asof_dt<-mdy("11/14/2014","11/21/2014","11/28/2014","12/5/2014","4/25/2014","5/2/2014","5/9/2014","5/16/2014","5/23/2014","5/30/2014","6/6/2014")
id<-c("ABC","ABC","ABC","ABC","XYZ","XYZ","XYZ","XYZ","XYZ","XYZ","XYZ")
value<-c(7,8,3,10,11,10,1,NA,9,3,10)
df<-data.frame(asof_dt,id,value)
> df
asof_dt id value
1 2014-11-14 ABC 7
2 2014-11-21 ABC 8
3 2014-11-28 ABC 3
4 2014-12-05 ABC 10
5 2014-04-25 XYZ 11
6 2014-05-02 XYZ 10
7 2014-05-09 XYZ 1
8 2014-05-16 XYZ NA
9 2014-05-23 XYZ 9
10 2014-05-30 XYZ 3
11 2014-06-06 XYZ 10
这是我期望的结果和我的评论,希望能带来更多清晰度。
asof_dt id value bad_perf Comment
11/14/2014 ABC 7 0 Assigned 0; not enough data
11/21/2014 ABC 8 0 Assigned 0; not enough data
11/28/2014 ABC 3 1 Assigned 1; this record and the previous 2 records are less than or equal to
12/5/2014 ABC 10 1 Assigned 1; this record and the previous 2 records are less than or equal to
4/25/2014 XYZ 11 0 Assigned 0; not enough data
5/2/2014 XYZ 10 0 Assigned 0; not enough data
5/9/2014 XYZ 1 0 Assigned 0; previous 2 records are not less than or equal to 10
5/16/2014 XYZ NA 0 Assigned 0; current value is NA
5/23/2014 XYZ 9 0 Assigned 0; at least 1 NA
5/30/2014 XYZ 3 0 Assigned 0; at least 1 NA
6/6/2014 XYZ 10 1 Assigned 1; this record and the previous 2 records are less than or equal to
很遗憾,不确定如何开始。我现在在 Excel 中执行此步骤!
非常感谢!
您可以尝试使用 base R
方法 (embed
) 在将 "value" 列拆分为 "id" 后创建 "lags"。然后检查每一行的所有元素是否都小于10(rowSums(...)
),unlist
得到index.
df$bad_perf <- unlist(sapply(split(df$value, df$id), function(x) {
x1 <-embed(c(rep(NA,2), x), 2)
as.numeric(rowSums(cbind(x, x1[-nrow(x1),])<=10, na.rm=TRUE)==3)
}), use.names=FALSE)
或者您可以使用 data.table 的开发版本,它引入了函数 shift
来获取 "lag" 列,然后像之前一样执行 rowSums
解决方案。
library(data.table) #data.table_1.9.5
df1 <- copy(df)
df1$bad_perf <- setDT(df)[,shift(value, n=0:2L) , id][,
(rowSums(.SD<=10, na.rm=TRUE)==3)+0L,.SDcols=2:4][]
或使用dplyr
,可以生成滞后列。
df1 <- df %>%
group_by(id) %>%
mutate(value1=lag(value), value2=lag(value, 2L))
df$bad_perf <- (rowSums(df1[3:5]<=10, na.rm=TRUE)==3)+0
df$bad_perf
#[1] 0 0 1 1 0 0 0 0 0 0 1