每 15 分钟间隔的最接近值
Closest Value for each 15 min interval
我正在寻找每 15 分钟间隔(即 12:00:00 AM、12:15:00 AM、12:30:00AM)的最近读数,以获得间隔之间任意数量的读数.
例如,我希望 df
:
Timestamp Value (kW)
8/12/2018 23:00:06 51
8/13/2018 0:00:16 52
8/13/2018 0:10:26 53
8/13/2018 0:14:36 54
8/13/2018 0:15:00 55
8/13/2018 0:19:57 56
8/13/2018 0:29:09 57
8/13/2018 0:38:17 58
8/13/2018 0:44:59 59
8/13/2018 0:45:00 60
8/13/2018 0:58:47 61
8/13/2018 1:01:57 62
structure(list(Timestamp = c("8/12/2018 23:00:00", "8/13/2018 0:00:00",
"8/13/2018 0:10:00", "8/13/2018 0:14:00", "8/13/2018 0:15:00",
"8/13/2018 0:19:00", "8/13/2018 0:29:00", "8/13/2018 0:38:00",
"8/13/2018 0:44:00", "8/13/2018 0:45:00", "8/13/2018 0:58:00",
"8/13/2018 1:01:00"), Value..kW. = 51:62), .Names = c("Timestamp",
"Value..kW."), class = "data.frame", row.names = c(NA, -12L))
看看更接近 df2
的东西:
Interval Value
8/13/2018 0:00:00 51
8/13/2018 0:15:00 55
8/13/2018 0:30:00 57
8/13/2018 0:45:00 60
8/13/2018 1:00:00 61
请同时注意 seconds
。
我在想 zoo
和 dplyr
或 data.table
中的 nalocf
函数可以让我中途到达那里。对其他包开放。
这可能与您的示例结果略有不同。我不确定您的示例输出是否 100% 正确。例如 12/8 的数据呢?
库 lubridate 有许多有用的日期时间功能。这会将字符转换为日期并四舍五入到最近的句点。 (还有 floor_date
和 ceiling_date
函数,分别向下或向上舍入)。
library(dplyr)
library(lubridate)
df %>%
# ensure timestamp is a date type and round to the nearest fifteen minutes
mutate(ts = mdy_hm(Timestamp),
period = round_date(ts, unit = "15 minutes")) %>%
# group into periods
group_by(period) %>%
# grab the first row in each period, orderd by the timestamp (use -1 for last)
top_n(-1, ts) %>%
# order the reuslt
arrange(period)
# Timestamp Value..kW. ts period
# <chr> <int> <dttm> <dttm>
# 1 8/12/2018 23:00 51 2018-08-12 23:00:00 2018-08-12 23:00:00
# 2 8/13/2018 0:00 52 2018-08-13 00:00:00 2018-08-13 00:00:00
# 3 8/13/2018 0:10 53 2018-08-13 00:10:00 2018-08-13 00:15:00
# 4 8/13/2018 0:29 57 2018-08-13 00:29:00 2018-08-13 00:30:00
# 5 8/13/2018 0:38 58 2018-08-13 00:38:00 2018-08-13 00:45:00
这可能是 data.table
使用 "nearest" 选项滚动连接的一个很好的应用程序。
第一步是将数据放入具有正确格式的 POSIXct
时间戳的 data.table
类型对象中。
library(data.table)
DT <- structure(list(Timestamp = c("8/12/2018 23:00:00", "8/13/2018 0:00:00",
"8/13/2018 0:10:00", "8/13/2018 0:14:00", "8/13/2018 0:15:00",
"8/13/2018 0:19:00", "8/13/2018 0:29:00", "8/13/2018 0:38:00",
"8/13/2018 0:44:00", "8/13/2018 0:45:00", "8/13/2018 0:58:00",
"8/13/2018 1:01:00"), Value..kW. = 51:62), .Names = c("Timestamp",
"Value..kW."), class = "data.frame", row.names = c(NA, -12L))
## Convert from data.frame to data.table
setDT(DT)
## Convert to POSIXct
DT[,Timestamp := as.POSIXct(Timestamp, format = "%m/%d/%Y %H:%M:%S", tz = "UTC")]
完成后,您可以使用 15 分钟的间隔序列生成另一个 table。
## Get Start and Ends
Start <- min(as.POSIXct(cut.POSIXt(DT[,Timestamp],breaks = c("15 min")), tz = "UTC"))
End <- max(as.POSIXct(cut.POSIXt(DT[,Timestamp],breaks = c("15 min")), tz = "UTC"))
## Generate data.table with a sequence
SummaryDT <- data.table(TimeStamp15 = seq.POSIXt(from = Start, to = End, by = "15 min"))
print(SummaryDT)
# TimeStamp15
# 1: 2018-08-12 23:00:00
# 2: 2018-08-12 23:15:00
# 3: 2018-08-12 23:30:00
# 4: 2018-08-12 23:45:00
# 5: 2018-08-13 00:00:00
# 6: 2018-08-13 00:15:00
# 7: 2018-08-13 00:30:00
# 8: 2018-08-13 00:45:00
# 9: 2018-08-13 01:00:00
然后,您可以设置键并使用滚动连接更新获取最接近每 15 分钟时间的值。
## Set keys
setkey(SummaryDT,TimeStamp15)
setkey(DT,Timestamp)
## Create a new column in SummaryDT with the closest measurement
SummaryDT[DT, Closest_Value_kW := `i.Value..kW.` , roll = "nearest"]
print(SummaryDT)
# TimeStamp15 Closest_Value_kW
# 1: 2018-08-12 23:00:00 51
# 2: 2018-08-12 23:15:00 NA
# 3: 2018-08-12 23:30:00 NA
# 4: 2018-08-12 23:45:00 NA
# 5: 2018-08-13 00:00:00 52
# 6: 2018-08-13 00:15:00 56
# 7: 2018-08-13 00:30:00 57
# 8: 2018-08-13 00:45:00 60
# 9: 2018-08-13 01:00:00 62
如果您是 data.table
的新手,这可能有点难以消化,这个例子属于高级范围——[=14= 上的 Getting Started 页面] 如果您以前从未使用过 data.table
,网站可能是一个不错的起点。
执行 help("data.table")
会给你一个简洁的描述,但是 Ben Gorman 在他的博客上写了一些能力的一个很好的例子 -- Gorman Analysis: R – Data.Table Rolling Joins and another by Rober Norberg on his blog bRogramming: Understanding data.table Rolling Joins 可能有助于获得更好的理解。
Update: It looks like you might want to only carry forward observations instead of necessarily doing the "closest" value -- In that case an option would be as follows:
(使用相同的DT
作为起点)
## Get Start and Ends
Start <- min(as.POSIXct(cut.POSIXt(DT[,Timestamp],breaks = c("15 min")), tz = "UTC"))
End <- max(as.POSIXct(cut.POSIXt(DT[,Timestamp],breaks = c("15 min"),), tz = "UTC"))
## Generate data.table with a sequence
SummaryDT <-data.table(TimeStamp15 = seq.POSIXt(from = Start, to = End, by = "15 min"))
## Set keys
setkey(SummaryDT,TimeStamp15)
setkey(DT,Timestamp)
## Do a rolling join
FinalDT <- DT[SummaryDT, roll = +Inf]
print(FinalDT)
# Timestamp Value..kW.
# 1: 2018-08-12 23:00:00 51
# 2: 2018-08-12 23:15:00 51
# 3: 2018-08-12 23:30:00 51
# 4: 2018-08-12 23:45:00 51
# 5: 2018-08-13 00:00:00 52
# 6: 2018-08-13 00:15:00 55
# 7: 2018-08-13 00:30:00 57
# 8: 2018-08-13 00:45:00 60
# 9: 2018-08-13 01:00:00 61
根据输入数据的结构和预期的结果,OP 有多种选择。
从问题和样本数据集来看,如果输入数据包含 gaps,即间隔长于 15 分钟,则预期结果应该是什么样子并不完全清楚没有数据被记录。 OP 希望输入数据中的差距如何反映在结果中?
编辑: OP 提供了两个略有不同的数据集。下面用两者来演示输入数据对结果的影响。
以下变体将使用 lubridate
和 data.table
。假设 df
已经被 Timesstamp
排序。
准备
所有变体都需要:
library(lubridate)
library(data.table)
setDT(df)[, Timestamp := mdy_hms(Timestamp)]
聚合到下一个 15 分钟间隔(结果中有间隙)
最简单的解决方案是聚合到下一个 15 分钟间隔:
df[, .SD[.N], by = .(Interval = ceiling_date(Timestamp, "15 min"))]
Interval Value..kW.
1: 2018-08-12 23:00:00 51
2: 2018-08-13 00:00:00 52
3: 2018-08-13 00:15:00 55
4: 2018-08-13 00:30:00 57
5: 2018-08-13 00:45:00 60
6: 2018-08-13 01:00:00 61
7: 2018-08-13 01:15:00 62
请注意,第 1 行和第 2 行之间有 1 小时的间隔,其中缺少 3 个间隔。
为了完整起见,这里有一个变体也适用于无序数据。
df[, .SD[which.max(Timestamp)], keyby = .(Interval = ceiling_date(Timestamp, "15 min"))]
编辑: 使用另一个数据集(没有截断的秒数)我们得到
df0[, .SD[.N], by = .(Interval = ceiling_date(Timestamp, "15 min"))]
1: 2018-08-12 23:15:00 51
2: 2018-08-13 00:15:00 55
3: 2018-08-13 00:30:00 57
4: 2018-08-13 00:45:00 60
5: 2018-08-13 01:00:00 61
6: 2018-08-13 01:15:00 62
请注意,如果没有截断秒数,值将移至下一个间隔。
聚合到下一个 15 分钟间隔,结果没有间隙
step <- "15 min"
df[, .SD[.N], by = .(Interval = ceiling_date(Timestamp, step))][
.(seq(min(Interval), max(Interval), step)), on = .(Interval = V1)]
这里我们加入了一系列时间戳来补齐缺失的区间:
Interval Value..kW.
1: 2018-08-12 23:00:00 51
2: 2018-08-12 23:15:00 NA
3: 2018-08-12 23:30:00 NA
4: 2018-08-12 23:45:00 NA
5: 2018-08-13 00:00:00 52
6: 2018-08-13 00:15:00 55
7: 2018-08-13 00:30:00 57
8: 2018-08-13 00:45:00 60
9: 2018-08-13 01:00:00 61
10: 2018-08-13 01:15:00 62
现在 NA
值的差距在结果中可见。
编辑: 使用另一个数据集(没有截断的秒数)我们得到
df0[, .SD[.N], by = .(Interval = ceiling_date(Timestamp, step))][
.(seq(min(Interval), max(Interval), step)), on = .(Interval = V1)]
Interval Value..kW.
1: 2018-08-12 23:15:00 51
2: 2018-08-12 23:30:00 NA
3: 2018-08-12 23:45:00 NA
4: 2018-08-13 00:00:00 NA
5: 2018-08-13 00:15:00 55
6: 2018-08-13 00:30:00 57
7: 2018-08-13 00:45:00 60
8: 2018-08-13 01:00:00 61
9: 2018-08-13 01:15:00 62
滚动联接(结果中的数据填充了空白)
这是
的精简版
step = "15 min"
df[.(seq(floor_date(min(Timestamp), step), ceiling_date(max(Timestamp), step),by = step)),
on = .(Timestamp = V1), roll = TRUE]
Timestamp Value..kW.
1: 2018-08-12 23:00:00 51
2: 2018-08-12 23:15:00 51
3: 2018-08-12 23:30:00 51
4: 2018-08-12 23:45:00 51
5: 2018-08-13 00:00:00 52
6: 2018-08-13 00:15:00 55
7: 2018-08-13 00:30:00 57
8: 2018-08-13 00:45:00 60
9: 2018-08-13 01:00:00 61
10: 2018-08-13 01:15:00 62
此处,空白处填充了从最新可用值复制的数据。从结果来看,输入数据中不再存在间隙。
编辑: 使用另一个数据集(没有截断的秒数)我们得到
df0[.(seq(floor_date(min(Timestamp), step), ceiling_date(max(Timestamp), step),by = step)),
on = .(Timestamp = V1), roll = TRUE]
Timestamp Value..kW.
1: 2018-08-12 23:00:00 NA
2: 2018-08-12 23:15:00 51
3: 2018-08-12 23:30:00 51
4: 2018-08-12 23:45:00 51
5: 2018-08-13 00:00:00 51
6: 2018-08-13 00:15:00 55
7: 2018-08-13 00:30:00 57
8: 2018-08-13 00:45:00 60
9: 2018-08-13 01:00:00 61
10: 2018-08-13 01:15:00 62
在这里,我们在第一行中有一个未填充的空白。这是由构造间隔序列的方式引起的。稍微修改一下就可以避免
df0[.(seq(ceiling_date(min(Timestamp), step), ceiling_date(max(Timestamp), step),by = step)),
on = .(Timestamp = V1), roll = TRUE]
Timestamp Value..kW.
1: 2018-08-12 23:15:00 51
2: 2018-08-12 23:30:00 51
3: 2018-08-12 23:45:00 51
4: 2018-08-13 00:00:00 51
5: 2018-08-13 00:15:00 55
6: 2018-08-13 00:30:00 57
7: 2018-08-13 00:45:00 60
8: 2018-08-13 01:00:00 61
9: 2018-08-13 01:15:00 62
数据
OP 提供的数据为 dput()
df <-
structure(list(Timestamp = c("8/12/2018 23:00:00", "8/13/2018 0:00:00",
"8/13/2018 0:10:00", "8/13/2018 0:14:00", "8/13/2018 0:15:00",
"8/13/2018 0:19:00", "8/13/2018 0:29:00", "8/13/2018 0:38:00",
"8/13/2018 0:44:00", "8/13/2018 0:45:00", "8/13/2018 0:58:00",
"8/13/2018 1:01:00"), Value..kW. = 51:62), .Names = c("Timestamp",
"Value..kW."), class = "data.frame", row.names = c(NA, -12L))
编辑: OP 提供了两个略有不同的数据集:
- as
dput()
秒被截断(df
在这个答案中)
- 如问题中打印的
df
没有截断秒数(此答案中的 df0
)
这种细微的差异会影响结果。因此,这是打印的数据集:
df0 <- data.frame(
readr::read_table(" Timestamp Value.(kW)
8/12/2018 23:00:06 51
8/13/2018 0:00:16 52
8/13/2018 0:10:26 53
8/13/2018 0:14:36 54
8/13/2018 0:15:00 55
8/13/2018 0:19:57 56
8/13/2018 0:29:09 57
8/13/2018 0:38:17 58
8/13/2018 0:44:59 59
8/13/2018 0:45:00 60
8/13/2018 0:58:47 61
8/13/2018 1:01:57 62
"))
# prepare
library(lubridate)
library(data.table)
setDT(df0)[, Timestamp := mdy_hms(Timestamp)]
我正在寻找每 15 分钟间隔(即 12:00:00 AM、12:15:00 AM、12:30:00AM)的最近读数,以获得间隔之间任意数量的读数.
例如,我希望 df
:
Timestamp Value (kW)
8/12/2018 23:00:06 51
8/13/2018 0:00:16 52
8/13/2018 0:10:26 53
8/13/2018 0:14:36 54
8/13/2018 0:15:00 55
8/13/2018 0:19:57 56
8/13/2018 0:29:09 57
8/13/2018 0:38:17 58
8/13/2018 0:44:59 59
8/13/2018 0:45:00 60
8/13/2018 0:58:47 61
8/13/2018 1:01:57 62
structure(list(Timestamp = c("8/12/2018 23:00:00", "8/13/2018 0:00:00",
"8/13/2018 0:10:00", "8/13/2018 0:14:00", "8/13/2018 0:15:00",
"8/13/2018 0:19:00", "8/13/2018 0:29:00", "8/13/2018 0:38:00",
"8/13/2018 0:44:00", "8/13/2018 0:45:00", "8/13/2018 0:58:00",
"8/13/2018 1:01:00"), Value..kW. = 51:62), .Names = c("Timestamp",
"Value..kW."), class = "data.frame", row.names = c(NA, -12L))
看看更接近 df2
的东西:
Interval Value
8/13/2018 0:00:00 51
8/13/2018 0:15:00 55
8/13/2018 0:30:00 57
8/13/2018 0:45:00 60
8/13/2018 1:00:00 61
请同时注意 seconds
。
我在想 zoo
和 dplyr
或 data.table
中的 nalocf
函数可以让我中途到达那里。对其他包开放。
这可能与您的示例结果略有不同。我不确定您的示例输出是否 100% 正确。例如 12/8 的数据呢?
库 lubridate 有许多有用的日期时间功能。这会将字符转换为日期并四舍五入到最近的句点。 (还有 floor_date
和 ceiling_date
函数,分别向下或向上舍入)。
library(dplyr)
library(lubridate)
df %>%
# ensure timestamp is a date type and round to the nearest fifteen minutes
mutate(ts = mdy_hm(Timestamp),
period = round_date(ts, unit = "15 minutes")) %>%
# group into periods
group_by(period) %>%
# grab the first row in each period, orderd by the timestamp (use -1 for last)
top_n(-1, ts) %>%
# order the reuslt
arrange(period)
# Timestamp Value..kW. ts period
# <chr> <int> <dttm> <dttm>
# 1 8/12/2018 23:00 51 2018-08-12 23:00:00 2018-08-12 23:00:00
# 2 8/13/2018 0:00 52 2018-08-13 00:00:00 2018-08-13 00:00:00
# 3 8/13/2018 0:10 53 2018-08-13 00:10:00 2018-08-13 00:15:00
# 4 8/13/2018 0:29 57 2018-08-13 00:29:00 2018-08-13 00:30:00
# 5 8/13/2018 0:38 58 2018-08-13 00:38:00 2018-08-13 00:45:00
这可能是 data.table
使用 "nearest" 选项滚动连接的一个很好的应用程序。
第一步是将数据放入具有正确格式的 POSIXct
时间戳的 data.table
类型对象中。
library(data.table)
DT <- structure(list(Timestamp = c("8/12/2018 23:00:00", "8/13/2018 0:00:00",
"8/13/2018 0:10:00", "8/13/2018 0:14:00", "8/13/2018 0:15:00",
"8/13/2018 0:19:00", "8/13/2018 0:29:00", "8/13/2018 0:38:00",
"8/13/2018 0:44:00", "8/13/2018 0:45:00", "8/13/2018 0:58:00",
"8/13/2018 1:01:00"), Value..kW. = 51:62), .Names = c("Timestamp",
"Value..kW."), class = "data.frame", row.names = c(NA, -12L))
## Convert from data.frame to data.table
setDT(DT)
## Convert to POSIXct
DT[,Timestamp := as.POSIXct(Timestamp, format = "%m/%d/%Y %H:%M:%S", tz = "UTC")]
完成后,您可以使用 15 分钟的间隔序列生成另一个 table。
## Get Start and Ends
Start <- min(as.POSIXct(cut.POSIXt(DT[,Timestamp],breaks = c("15 min")), tz = "UTC"))
End <- max(as.POSIXct(cut.POSIXt(DT[,Timestamp],breaks = c("15 min")), tz = "UTC"))
## Generate data.table with a sequence
SummaryDT <- data.table(TimeStamp15 = seq.POSIXt(from = Start, to = End, by = "15 min"))
print(SummaryDT)
# TimeStamp15
# 1: 2018-08-12 23:00:00
# 2: 2018-08-12 23:15:00
# 3: 2018-08-12 23:30:00
# 4: 2018-08-12 23:45:00
# 5: 2018-08-13 00:00:00
# 6: 2018-08-13 00:15:00
# 7: 2018-08-13 00:30:00
# 8: 2018-08-13 00:45:00
# 9: 2018-08-13 01:00:00
然后,您可以设置键并使用滚动连接更新获取最接近每 15 分钟时间的值。
## Set keys
setkey(SummaryDT,TimeStamp15)
setkey(DT,Timestamp)
## Create a new column in SummaryDT with the closest measurement
SummaryDT[DT, Closest_Value_kW := `i.Value..kW.` , roll = "nearest"]
print(SummaryDT)
# TimeStamp15 Closest_Value_kW
# 1: 2018-08-12 23:00:00 51
# 2: 2018-08-12 23:15:00 NA
# 3: 2018-08-12 23:30:00 NA
# 4: 2018-08-12 23:45:00 NA
# 5: 2018-08-13 00:00:00 52
# 6: 2018-08-13 00:15:00 56
# 7: 2018-08-13 00:30:00 57
# 8: 2018-08-13 00:45:00 60
# 9: 2018-08-13 01:00:00 62
如果您是 data.table
的新手,这可能有点难以消化,这个例子属于高级范围——[=14= 上的 Getting Started 页面] 如果您以前从未使用过 data.table
,网站可能是一个不错的起点。
执行 help("data.table")
会给你一个简洁的描述,但是 Ben Gorman 在他的博客上写了一些能力的一个很好的例子 -- Gorman Analysis: R – Data.Table Rolling Joins and another by Rober Norberg on his blog bRogramming: Understanding data.table Rolling Joins 可能有助于获得更好的理解。
Update: It looks like you might want to only carry forward observations instead of necessarily doing the "closest" value -- In that case an option would be as follows:
(使用相同的DT
作为起点)
## Get Start and Ends
Start <- min(as.POSIXct(cut.POSIXt(DT[,Timestamp],breaks = c("15 min")), tz = "UTC"))
End <- max(as.POSIXct(cut.POSIXt(DT[,Timestamp],breaks = c("15 min"),), tz = "UTC"))
## Generate data.table with a sequence
SummaryDT <-data.table(TimeStamp15 = seq.POSIXt(from = Start, to = End, by = "15 min"))
## Set keys
setkey(SummaryDT,TimeStamp15)
setkey(DT,Timestamp)
## Do a rolling join
FinalDT <- DT[SummaryDT, roll = +Inf]
print(FinalDT)
# Timestamp Value..kW.
# 1: 2018-08-12 23:00:00 51
# 2: 2018-08-12 23:15:00 51
# 3: 2018-08-12 23:30:00 51
# 4: 2018-08-12 23:45:00 51
# 5: 2018-08-13 00:00:00 52
# 6: 2018-08-13 00:15:00 55
# 7: 2018-08-13 00:30:00 57
# 8: 2018-08-13 00:45:00 60
# 9: 2018-08-13 01:00:00 61
根据输入数据的结构和预期的结果,OP 有多种选择。
从问题和样本数据集来看,如果输入数据包含 gaps,即间隔长于 15 分钟,则预期结果应该是什么样子并不完全清楚没有数据被记录。 OP 希望输入数据中的差距如何反映在结果中?
编辑: OP 提供了两个略有不同的数据集。下面用两者来演示输入数据对结果的影响。
以下变体将使用 lubridate
和 data.table
。假设 df
已经被 Timesstamp
排序。
准备
所有变体都需要:
library(lubridate)
library(data.table)
setDT(df)[, Timestamp := mdy_hms(Timestamp)]
聚合到下一个 15 分钟间隔(结果中有间隙)
最简单的解决方案是聚合到下一个 15 分钟间隔:
df[, .SD[.N], by = .(Interval = ceiling_date(Timestamp, "15 min"))]
Interval Value..kW. 1: 2018-08-12 23:00:00 51 2: 2018-08-13 00:00:00 52 3: 2018-08-13 00:15:00 55 4: 2018-08-13 00:30:00 57 5: 2018-08-13 00:45:00 60 6: 2018-08-13 01:00:00 61 7: 2018-08-13 01:15:00 62
请注意,第 1 行和第 2 行之间有 1 小时的间隔,其中缺少 3 个间隔。
为了完整起见,这里有一个变体也适用于无序数据。
df[, .SD[which.max(Timestamp)], keyby = .(Interval = ceiling_date(Timestamp, "15 min"))]
编辑: 使用另一个数据集(没有截断的秒数)我们得到
df0[, .SD[.N], by = .(Interval = ceiling_date(Timestamp, "15 min"))]
1: 2018-08-12 23:15:00 51 2: 2018-08-13 00:15:00 55 3: 2018-08-13 00:30:00 57 4: 2018-08-13 00:45:00 60 5: 2018-08-13 01:00:00 61 6: 2018-08-13 01:15:00 62
请注意,如果没有截断秒数,值将移至下一个间隔。
聚合到下一个 15 分钟间隔,结果没有间隙
step <- "15 min"
df[, .SD[.N], by = .(Interval = ceiling_date(Timestamp, step))][
.(seq(min(Interval), max(Interval), step)), on = .(Interval = V1)]
这里我们加入了一系列时间戳来补齐缺失的区间:
Interval Value..kW. 1: 2018-08-12 23:00:00 51 2: 2018-08-12 23:15:00 NA 3: 2018-08-12 23:30:00 NA 4: 2018-08-12 23:45:00 NA 5: 2018-08-13 00:00:00 52 6: 2018-08-13 00:15:00 55 7: 2018-08-13 00:30:00 57 8: 2018-08-13 00:45:00 60 9: 2018-08-13 01:00:00 61 10: 2018-08-13 01:15:00 62
现在 NA
值的差距在结果中可见。
编辑: 使用另一个数据集(没有截断的秒数)我们得到
df0[, .SD[.N], by = .(Interval = ceiling_date(Timestamp, step))][
.(seq(min(Interval), max(Interval), step)), on = .(Interval = V1)]
Interval Value..kW. 1: 2018-08-12 23:15:00 51 2: 2018-08-12 23:30:00 NA 3: 2018-08-12 23:45:00 NA 4: 2018-08-13 00:00:00 NA 5: 2018-08-13 00:15:00 55 6: 2018-08-13 00:30:00 57 7: 2018-08-13 00:45:00 60 8: 2018-08-13 01:00:00 61 9: 2018-08-13 01:15:00 62
滚动联接(结果中的数据填充了空白)
这是
step = "15 min"
df[.(seq(floor_date(min(Timestamp), step), ceiling_date(max(Timestamp), step),by = step)),
on = .(Timestamp = V1), roll = TRUE]
Timestamp Value..kW. 1: 2018-08-12 23:00:00 51 2: 2018-08-12 23:15:00 51 3: 2018-08-12 23:30:00 51 4: 2018-08-12 23:45:00 51 5: 2018-08-13 00:00:00 52 6: 2018-08-13 00:15:00 55 7: 2018-08-13 00:30:00 57 8: 2018-08-13 00:45:00 60 9: 2018-08-13 01:00:00 61 10: 2018-08-13 01:15:00 62
此处,空白处填充了从最新可用值复制的数据。从结果来看,输入数据中不再存在间隙。
编辑: 使用另一个数据集(没有截断的秒数)我们得到
df0[.(seq(floor_date(min(Timestamp), step), ceiling_date(max(Timestamp), step),by = step)),
on = .(Timestamp = V1), roll = TRUE]
Timestamp Value..kW. 1: 2018-08-12 23:00:00 NA 2: 2018-08-12 23:15:00 51 3: 2018-08-12 23:30:00 51 4: 2018-08-12 23:45:00 51 5: 2018-08-13 00:00:00 51 6: 2018-08-13 00:15:00 55 7: 2018-08-13 00:30:00 57 8: 2018-08-13 00:45:00 60 9: 2018-08-13 01:00:00 61 10: 2018-08-13 01:15:00 62
在这里,我们在第一行中有一个未填充的空白。这是由构造间隔序列的方式引起的。稍微修改一下就可以避免
df0[.(seq(ceiling_date(min(Timestamp), step), ceiling_date(max(Timestamp), step),by = step)),
on = .(Timestamp = V1), roll = TRUE]
Timestamp Value..kW.
1: 2018-08-12 23:15:00 51
2: 2018-08-12 23:30:00 51
3: 2018-08-12 23:45:00 51
4: 2018-08-13 00:00:00 51
5: 2018-08-13 00:15:00 55
6: 2018-08-13 00:30:00 57
7: 2018-08-13 00:45:00 60
8: 2018-08-13 01:00:00 61
9: 2018-08-13 01:15:00 62
数据
OP 提供的数据为 dput()
df <-
structure(list(Timestamp = c("8/12/2018 23:00:00", "8/13/2018 0:00:00",
"8/13/2018 0:10:00", "8/13/2018 0:14:00", "8/13/2018 0:15:00",
"8/13/2018 0:19:00", "8/13/2018 0:29:00", "8/13/2018 0:38:00",
"8/13/2018 0:44:00", "8/13/2018 0:45:00", "8/13/2018 0:58:00",
"8/13/2018 1:01:00"), Value..kW. = 51:62), .Names = c("Timestamp",
"Value..kW."), class = "data.frame", row.names = c(NA, -12L))
编辑: OP 提供了两个略有不同的数据集:
- as
dput()
秒被截断(df
在这个答案中) - 如问题中打印的
df
没有截断秒数(此答案中的df0
)
这种细微的差异会影响结果。因此,这是打印的数据集:
df0 <- data.frame(
readr::read_table(" Timestamp Value.(kW)
8/12/2018 23:00:06 51
8/13/2018 0:00:16 52
8/13/2018 0:10:26 53
8/13/2018 0:14:36 54
8/13/2018 0:15:00 55
8/13/2018 0:19:57 56
8/13/2018 0:29:09 57
8/13/2018 0:38:17 58
8/13/2018 0:44:59 59
8/13/2018 0:45:00 60
8/13/2018 0:58:47 61
8/13/2018 1:01:57 62
"))
# prepare
library(lubridate)
library(data.table)
setDT(df0)[, Timestamp := mdy_hms(Timestamp)]