如何将时间数据(因子)子集化为每小时间隔

How to subset time data (factor) into hourly intervals

现在我有一个 data.frame 带有 dim(1:1080) 变量日期、时间和 glob.rad。

      date      time         glob.rad
1   2014/07/19  00:00:00     -1.6
2   2014/07/19  00:02:00     -1.6
3   2014/07/19  00:03:00     -1.6
4   2014/07/19  00:04:00     -1.6
5   2014/07/19  00:06:00     -1.6
6   2014/07/19  00:07:00     -1.6
7   2014/07/19  00:08:00     -1.6
8   2014/07/19  00:10:00     -1.6
9   2014/07/19  00:11:00     -1.6
10  2014/07/19  00:12:00     -1.6
11  2014/07/19  00:14:00     -1.6
12  2014/07/19  00:15:00     -1.6
13  2014/07/19  00:16:00     -1.6
14  2014/07/19  00:18:00     -1.5
15  2014/07/19  00:19:00     -1.5
16  2014/07/19  00:20:00     -1.4
17  2014/07/19  00:22:00     -1.4
18  2014/07/19  00:23:00     -1.3
19  2014/07/19  00:24:00     -1.3
20  2014/07/19  00:26:00     -1.3
21  2014/07/19  00:27:00     -1.3
22  2014/07/19  00:28:00     -1.3
23  2014/07/19  00:30:00     -1.3
24  2014/07/19  00:31:00     -1.4
25  2014/07/19  00:32:00     -1.4
26  2014/07/19  00:34:00     -1.5
27  2014/07/19  00:35:00     -1.5
28  2014/07/19  00:36:00     -1.6
29  2014/07/19  00:38:00     -1.6
30  2014/07/19  00:39:00     -1.6
31  2014/07/19  00:40:00     -1.6
32  2014/07/19  00:42:00     -1.6
33  2014/07/19  00:43:00     -1.6
34  2014/07/19  00:44:00     -1.6
35  2014/07/19  00:46:00     -1.6
36  2014/07/19  00:47:00     -1.6
37  2014/07/19  00:48:00     -1.6
38  2014/07/19  00:50:00     -1.6
39  2014/07/19  00:51:00     -1.6
40  2014/07/19  00:52:00     -1.6
41  2014/07/19  00:54:00     -1.6
42  2014/07/19  00:55:00     -1.6
43  2014/07/19  00:56:00     -1.6
44  2014/07/19  00:58:00     -1.6
45  2014/07/19  00:59:00     -1.6
46  2014/07/19  01:00:00     -1.6
47  2014/07/19  01:02:00     -1.6
48  2014/07/19  01:03:00     -1.6
49  2014/07/19  01:04:00     -1.6
50  2014/07/19  01:06:00     -1.6
... 

所有变量都是因素。目的是将变量 "time" 子集化为每小时间隔,以便计算一小时内 "glob.rad" 的平均值。

    date        time         glob.rad
1   2014/07/19  00:00:00     -1.6
2   2014/07/19  01:00:00     -1.6
3   2014/07/19  02:00:00     -1.6
...

虽然我知道如何将 POSIXct 数据作为日期时间处理,但不知道如何将时间作为因素处理。 到目前为止,我已经尝试了 cut()subset() 以及 as.numeric(),但它不起作用。

我喜欢带管道的 dplyr 的语义 (%>%)。这很像读一个句子。

tab <- structure(list(date = c("2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", "2014/07/19", 
"2014/07/19", "2014/07/19"), time = c("00:00:00", "00:02:00", 
"00:03:00", "00:04:00", "00:06:00", "00:07:00", "00:08:00", "00:10:00", 
"00:11:00", "00:12:00", "00:14:00", "00:15:00", "00:16:00", "00:18:00", 
"00:19:00", "00:20:00", "00:22:00", "00:23:00", "00:24:00", "00:26:00", 
"00:27:00", "00:28:00", "00:30:00", "00:31:00", "00:32:00", "00:34:00", 
"00:35:00", "00:36:00", "00:38:00", "00:39:00", "00:40:00", "00:42:00", 
"00:43:00", "00:44:00", "00:46:00", "00:47:00", "00:48:00", "00:50:00", 
"00:51:00", "00:52:00", "00:54:00", "00:55:00", "00:56:00", "00:58:00", 
"00:59:00", "01:00:00", "01:02:00", "01:03:00", "01:04:00", "01:06:00"
), glob.rad = c(-1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6, 
-1.6, -1.6, -1.6, -1.6, -1.6, -1.5, -1.5, -1.4, -1.4, -1.3, -1.3, 
-1.3, -1.3, -1.3, -1.3, -1.4, -1.4, -1.5, -1.5, -1.6, -1.6, -1.6, 
-1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6, 
-1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6, -1.6)), .Names = c("date", 
"time", "glob.rad"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", 
"36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", 
"47", "48", "49", "50"))


#> head(tab)
#        date     time glob.rad
#1 2014/07/19 00:00:00     -1.6
#2 2014/07/19 00:02:00     -1.6
#3 2014/07/19 00:03:00     -1.6
#4 2014/07/19 00:04:00     -1.6
#5 2014/07/19 00:06:00     -1.6
#6 2014/07/19 00:07:00     -1.6

library(lubridate)
library(dplyr)

tab$date <- ymd_hms(paste(tab$date, tab$time))
tab$hour <- hour(tab$date)
#head(tab)
tab%>%
  group_by(hour)%>%
  summarise(avg=mean(glob.rad, na.rm=T))

#Source: local data frame [2 x 2]
#
#  hour       avg
#1    0 -1.533333
#2    1 -1.600000

如果您想按天和小时汇总 glob.rad,为简单起见,您可以创建一个新的变量,从日期列中提取日期。

tab$day <- day(tab$date)

并将其添加到您的分组

tab%>%
  group_by(day, hour)%>%
  summarise(avg=mean(glob.rad, na.rm=T))

Source: local data frame [2 x 3]
Groups: day

  day hour       avg
1  19    0 -1.533333
2  19    1 -1.600000

sessionInfo()
#R version 3.2.2 (2015-08-14)
#...
#other attached packages:
#[1] lubridate_1.3.3 dplyr_0.4.2

您不需要将时间作为一个因素来处理。您可以这样做,但是将日期和时间列粘贴在一起以用于分组将减轻生活压力。 data.table 包使这非常容易,因为它具有提取 POSIX/Date 对象的各个部分的功能。我们可以将这些部分用于我们的分组。

library(data.table)
setDT(df)[, .(mean = mean(glob.rad)), by = hour(paste(date, time))]
#    hour      mean
# 1:    0 -1.533333
# 2:    1 -1.600000

原始数据保持不变,只是被转换为数据table。如果你想要结果中的日期和时间,你可以做

df[, .(mean = mean(glob.rad)), by = .(date, hour(paste(date, time)))]
#          date hour      mean
# 1: 2014/07/19    0 -1.533333
# 2: 2014/07/19    1 -1.600000

最后一个块确实在日期列中使用了一个因子,因为我发现没有必要将其更改为日期分类列。