R filtering/selecting data by POSIXct time and a condition

Question

我已经在不同的城市树种上以 10 分钟的高时间分辨率测量了温度，应该比较它们的反应。因此，我正在研究特别热的时期。我无法对我的数据集执行的任务是从最大值中选择完整的天数。例如。有一个高于 30 °C 的测量值的日子应该从我的数据框中完全被子集化。您在下面找到一个可以说明我的问题的可重现示例：

在我的 Measurings 数据框中，我计算了一个列，指示单个测量值是高于还是低于 30°C。我想使用该列来告诉其他函数他们是否应该选择一天来生成 New Dataframe。当一天中的任何时候该值高于 30°C 时，我想按 00:00 到 23:59 的日期将其包含在 New Dataframe 中以供进一步分析。

start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")

Measurings <- data.frame(
  Time = tseq,
  Temp = sample(20:35,1000, replace = TRUE),
  Variable1 = sample(1:200,1000, replace = TRUE),
  Variable2 = sample(300:800,1000, replace = TRUE)
)

Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")

Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")

该示例正在生成一个类似于我的数据结构的数据帧：

head(Measurings)

                 Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00   28        56       377 normal             0
2 2018-05-18 01:00:00   23        65       408 normal             0
3 2018-05-18 02:00:00   29        78       324 normal             0
4 2018-05-18 03:00:00   24       157       432 normal             0
5 2018-05-18 04:00:00   32       129       794   heat             1
6 2018-05-18 05:00:00   25        27       574 normal             0

那么我如何子集得到一个 New Dataframe 其中所有的日子都是至少有一个条目表示为"heat"?

我知道 dplyr:filter 可以过滤单个条目（示例开头的第 5 行）。 但是我怎么知道要花一整天 2018-05-18？

我对使用 R 分析数据还很陌生，所以我很感激任何关于我的问题的有效解决方案的建议。 dplyr这是我在很多任务中一直使用的，但我对任何有用的东西都持开放态度。

非常感谢，康拉德

Answer 1

以下是使用问题中提供的数据集的一种可能解决方案。请注意，这不是一个很好的例子，因为所有日子可能都包括至少一个标记为超过 30 °C 的观测值（即在该数据集中没有要过滤掉的日子，但代码应该用实际的代码完成工作）。

# import packages
library(dplyr)
library(stringr)

# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))

# name the columns
names(time_df) <- c("Day", "Hour")

# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])

# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
  filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])

更准确地说，您正在创建一个随机样本，其中包含 40 天内温度在 20 到 35 之间变化的 1000 个观测值。因此，在您的示例中，很可能每一天至少有一次观测值被标记为超过 30 °C。此外，设置种子以确保 reproducibility 始终是一个好习惯。

Answer 2

创建指定哪一天的变量（删除小时、分钟等）。遍历唯一日期，只取 heat30 中至少包含 "heat" 一次的子集：

Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))

res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){

  ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
  rr <- Measurings %>% filter(Time2 == x) # select date x

  # check if heat30 vector contains heat value at least once, if so bind that subset 
  if(any(ss == "heat")){
    res <- rbind(res, rr)
  }
  return(res)

}) %>% bind_rows()

R filtering/selecting data by POSIXct time and a condition

R filtering/selecting data by POSIXct time and a condition

datetime

r

time-series

posixct

dplyr