ddply 每小时汇总数据
ddply summarize data hourly
我想总结一个数据集每小时和每两小时的频率。时间列的格式为 hh:mm:ss.
下面的代码用于每月汇总数据,但我没有找到任何类似的每小时或每两小时的代码。
提前致谢。
data2$StartDate <- as.Date(data2$StartDate, "%m/%d/%Y")
data4 <- ddply(data2, .(format(StartDate, "%m")), summarize, freq=length(StartDate))
数据集是这样的:
TripId StartDate StartTime
<int> <date> <S3: times>
1 15335543 2016-01-01 00:14:00
2 15335544 2016-01-01 00:14:00
3 15335607 2016-01-01 02:00:00
4 15335608 2016-01-01 02:01:00
5 15335613 2016-01-01 02:16:00
6 15335639 2016-01-01 02:50:00
如果我正确理解了问题,那么
每小时频率:
library(dplyr)
df %>%
mutate(start_timestamp = as.POSIXct(paste(df$StartDate, df$StartTime), tz="UTC", format="%Y-%m-%d %H")) %>%
right_join(data.frame(seq_h = as.POSIXct(unlist(lapply(unique(df$StartDate),
function(x) seq(from=as.POSIXct(paste(x, "00:00:00"), tz="UTC"),
to=as.POSIXct(paste(x, "23:00:00"), tz="UTC"),
by="hour"))), origin="1970-01-01", tz="UTC")), by=c("start_timestamp" = "seq_h")) %>%
group_by(start_timestamp) %>%
summarise(freq=sum(!is.na(TripId)))
输出为:
start_timestamp freq
1 2016-01-01 00:00:00 2
2 2016-01-01 01:00:00 1
3 2016-01-01 02:00:00 1
4 2016-01-01 03:00:00 0
5 2016-01-01 04:00:00 0
...
对于two-hourly频率:
library(dplyr)
df %>%
mutate(start_timestamp = as.POSIXct(cut(as.POSIXct(paste(df$StartDate, df$StartTime), tz="UTC"), breaks="2 hours"), tz="UTC")) %>%
right_join(data.frame(seq_h = as.POSIXct(unlist(lapply(unique(df$StartDate),
function(x) seq(from=as.POSIXct(paste(x, "00:00:00"), tz="UTC"),
to=as.POSIXct(paste(x, "23:00:00"), tz="UTC"),
by="2 hours"))), origin="1970-01-01", tz="UTC")), by=c("start_timestamp" = "seq_h")) %>%
group_by(start_timestamp) %>%
summarise(freq=sum(!is.na(TripId)))
输出为:
start_timestamp freq
1 2016-01-01 00:00:00 3
2 2016-01-01 02:00:00 1
3 2016-01-01 04:00:00 0
4 2016-01-01 06:00:00 0
5 2016-01-01 08:00:00 0
...
示例数据:
df <- structure(list(TripId = c(15335543L, 15335544L, 15335607L, 15335608L,
15335613L, 15335639L), StartDate = c("2016-01-01", "2016-01-01",
"2016-01-01", "2016-01-01", "2016-01-02", "2016-01-02"), StartTime = c("00:14:00",
"00:14:00", "01:00:00", "02:01:00", "02:16:00", "02:50:00")), .Names = c("TripId",
"StartDate", "StartTime"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
我想总结一个数据集每小时和每两小时的频率。时间列的格式为 hh:mm:ss.
下面的代码用于每月汇总数据,但我没有找到任何类似的每小时或每两小时的代码。
提前致谢。
data2$StartDate <- as.Date(data2$StartDate, "%m/%d/%Y")
data4 <- ddply(data2, .(format(StartDate, "%m")), summarize, freq=length(StartDate))
数据集是这样的:
TripId StartDate StartTime
<int> <date> <S3: times>
1 15335543 2016-01-01 00:14:00
2 15335544 2016-01-01 00:14:00
3 15335607 2016-01-01 02:00:00
4 15335608 2016-01-01 02:01:00
5 15335613 2016-01-01 02:16:00
6 15335639 2016-01-01 02:50:00
如果我正确理解了问题,那么
每小时频率:
library(dplyr)
df %>%
mutate(start_timestamp = as.POSIXct(paste(df$StartDate, df$StartTime), tz="UTC", format="%Y-%m-%d %H")) %>%
right_join(data.frame(seq_h = as.POSIXct(unlist(lapply(unique(df$StartDate),
function(x) seq(from=as.POSIXct(paste(x, "00:00:00"), tz="UTC"),
to=as.POSIXct(paste(x, "23:00:00"), tz="UTC"),
by="hour"))), origin="1970-01-01", tz="UTC")), by=c("start_timestamp" = "seq_h")) %>%
group_by(start_timestamp) %>%
summarise(freq=sum(!is.na(TripId)))
输出为:
start_timestamp freq
1 2016-01-01 00:00:00 2
2 2016-01-01 01:00:00 1
3 2016-01-01 02:00:00 1
4 2016-01-01 03:00:00 0
5 2016-01-01 04:00:00 0
...
对于two-hourly频率:
library(dplyr)
df %>%
mutate(start_timestamp = as.POSIXct(cut(as.POSIXct(paste(df$StartDate, df$StartTime), tz="UTC"), breaks="2 hours"), tz="UTC")) %>%
right_join(data.frame(seq_h = as.POSIXct(unlist(lapply(unique(df$StartDate),
function(x) seq(from=as.POSIXct(paste(x, "00:00:00"), tz="UTC"),
to=as.POSIXct(paste(x, "23:00:00"), tz="UTC"),
by="2 hours"))), origin="1970-01-01", tz="UTC")), by=c("start_timestamp" = "seq_h")) %>%
group_by(start_timestamp) %>%
summarise(freq=sum(!is.na(TripId)))
输出为:
start_timestamp freq
1 2016-01-01 00:00:00 3
2 2016-01-01 02:00:00 1
3 2016-01-01 04:00:00 0
4 2016-01-01 06:00:00 0
5 2016-01-01 08:00:00 0
...
示例数据:
df <- structure(list(TripId = c(15335543L, 15335544L, 15335607L, 15335608L,
15335613L, 15335639L), StartDate = c("2016-01-01", "2016-01-01",
"2016-01-01", "2016-01-01", "2016-01-02", "2016-01-02"), StartTime = c("00:14:00",
"00:14:00", "01:00:00", "02:01:00", "02:16:00", "02:50:00")), .Names = c("TripId",
"StartDate", "StartTime"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))