R:是-否基于先前条目的因素

R: yes-no factor based on previous entries

我有一个时间序列数据集——来自气象站的数据。所以有 3 列: time - 时间和日期; p - 雨,毫米; h - 水位,m.

我需要创建一个新列 factor_rain,其中包含 10 值。 1 - 如果水位 (df$h) 受到降雨 (df$p) 的影响。如果过去 5 小时 下雨(5 条条目),则可能会出现这种情况。 在其他情况下,应该有0.

数据集的一部分在这里:

df <- data.frame(time = c("2017-06-04 9:00:00", "2017-06-04 13:00:00",  "2017-06-04 17:00:00",
                            "2017-06-04 19:00:00",  "2017-06-04 21:00:00",  "2017-06-04 23:00:00",
                            "2017-06-05 9:00:00",   "2017-06-05 11:00:00",
                            "2017-06-05 13:00:00",  "2017-06-05 16:00:00",
                            "2017-06-05 19:00:00",  "2017-06-05 21:00:00",  "2017-06-05 23:00:00",
                            "2017-06-06 9:00:00",   "2017-06-06 11:00:00",  "2017-06-06 13:00:00",
                            "2017-06-06 16:00:00",  "2017-06-06 17:00:00",  "2017-06-06 18:00:00",
                            "2017-06-06 19:00:00"),
                   p = c(NA, NA, 16.4, NA, NA, NA, NA, NA, NA, NA, 12, 
                         NA, NA, NA, NA, NA, NA, NA, NA, NA),
                   h = c(23,NA,NA,NA,NA,32,NA,NA,28,NA,NA,
                        33,NA,NA,NA,29,NA,NA,NA,NA))

我正在尝试我认为最简单的方法——不幸的是它只适用于一种情况:

> df$factor_rain[df$p[-c(1:5)] > 1 & df$h > 1] <- 1
> Warning message:
In df$p[-c(1:5)] > 1 & df$h > 1 :
  longer object length is not a multiple of shorter object length

有什么办法可以解决吗?如果您能建议如何使用实时(例如 xts 库中的 smth),那就太好了。我的意思是使用 5 小时阈值,而不是 5 个值。

顺便说一句,我需要得到这个结果:

> df
                  time    p  h factor_rain
1   2017-06-04 9:00:00   NA 23           0
2  2017-06-04 13:00:00   NA NA           0
3  2017-06-04 17:00:00 16.4 NA           0
4  2017-06-04 19:00:00   NA NA           0
5  2017-06-04 21:00:00   NA NA           0
6  2017-06-04 23:00:00   NA 32           1
7   2017-06-05 9:00:00   NA NA           0
8  2017-06-05 11:00:00   NA NA           0
9  2017-06-05 13:00:00   NA 28           0
10 2017-06-05 16:00:00   NA NA           0
11 2017-06-05 19:00:00 12.0 NA           0
12 2017-06-05 21:00:00   NA 33           1
13 2017-06-05 23:00:00   NA NA           0
14  2017-06-06 9:00:00   NA NA           0
15 2017-06-06 11:00:00   NA NA           0
16 2017-06-06 13:00:00   NA 29           0
17 2017-06-06 16:00:00   NA NA           0
18 2017-06-06 17:00:00   NA NA           0
19 2017-06-06 18:00:00   NA NA           0
20 2017-06-06 19:00:00   NA NA           0

您可以使用

df$factorrain = FALSE
df$factorrain[rowSums(expand.grid(which(!is.na(df$p)), 0:4))] = TRUE

#                   time    p  h factorrain
# 1   2017-06-04 9:00:00   NA 23   FALSE
# 2  2017-06-04 13:00:00   NA NA   FALSE
# 3  2017-06-04 17:00:00 16.4 NA    TRUE
# 4  2017-06-04 19:00:00   NA NA    TRUE
# 5  2017-06-04 21:00:00   NA NA    TRUE
# 6  2017-06-04 23:00:00   NA 32    TRUE
# 7   2017-06-05 9:00:00   NA NA    TRUE
# 8  2017-06-05 11:00:00   NA NA   FALSE
# 9  2017-06-05 13:00:00   NA 28   FALSE
# 10 2017-06-05 16:00:00   NA NA   FALSE
# 11 2017-06-05 19:00:00 12.0 NA    TRUE
# 12 2017-06-05 21:00:00   NA 33    TRUE
# 13 2017-06-05 23:00:00   NA NA    TRUE
# 14  2017-06-06 9:00:00   NA NA    TRUE
# 15 2017-06-06 11:00:00   NA NA    TRUE
# 16 2017-06-06 13:00:00   NA 29   FALSE
# 17 2017-06-06 16:00:00   NA NA   FALSE
# 18 2017-06-06 17:00:00   NA NA   FALSE
# 19 2017-06-06 18:00:00   NA NA   FALSE
# 20 2017-06-06 19:00:00   NA NA   FALSE

或者,与 apply 类似的方法,

df$factorrain = FALSE
df$factorrain[sapply(which(!is.na(df$p)), function(x) x+(0:4))] = TRUE

可以通过使用 data.table 中的 non-equi join 来实现解决方案。

library(data.table)

df$time <- as.POSIXct(df$time, format = "%Y-%m-%d %H:%M:%S")

setDT(df)
df[,timeLow := time-5*60*60]

df[df,.(time, p, h = i.h), on=.(time < time, time >= timeLow)][
  ,.(factor_rain = ifelse(!is.na(first(h)), any(!is.na(p)),FALSE)),by=.(time)][
    df,.(time, p, h, factor_rain),on="time"]

#                   time    p  h factor_rain
# 1: 2017-06-04 09:00:00   NA 23       FALSE
# 2: 2017-06-04 13:00:00   NA NA       FALSE
# 3: 2017-06-04 17:00:00 16.4 NA       FALSE
# 4: 2017-06-04 19:00:00   NA NA       FALSE
# 5: 2017-06-04 21:00:00   NA NA       FALSE
# 6: 2017-06-04 23:00:00   NA 32       FALSE   <-- There is no rain in last 5 hours
# 7: 2017-06-05 09:00:00   NA NA       FALSE
# 8: 2017-06-05 11:00:00   NA NA       FALSE
# 9: 2017-06-05 13:00:00   NA 28       FALSE
# 10: 2017-06-05 16:00:00   NA NA       FALSE
# 11: 2017-06-05 19:00:00 12.0 NA       FALSE
# 12: 2017-06-05 21:00:00   NA 33        TRUE
# 13: 2017-06-05 23:00:00   NA NA       FALSE
# 14: 2017-06-06 09:00:00   NA NA       FALSE
# 15: 2017-06-06 11:00:00   NA NA       FALSE
# 16: 2017-06-06 13:00:00   NA 29       FALSE
# 17: 2017-06-06 16:00:00   NA NA       FALSE
# 18: 2017-06-06 17:00:00   NA NA       FALSE
# 19: 2017-06-06 18:00:00   NA NA       FALSE
# 20: 2017-06-06 19:00:00   NA NA       FALSE

注意:解决方案可以稍微优化一下。我会在一段时间内进行优化。