使用 R 的时间序列数据聚合和 NA 处理
Time Series data aggregation and NA handling using R
我有一个格式为
的时间序列数据
Ask Bid Trade Ask_Size Bid_Size Trade_Size
2016-11-01 01:00:03 NA 938.10 NA NA 203 NA
2016-11-01 01:00:04 NA 937.20 NA NA 100 NA
2016-11-01 01:00:04 938.00 NA NA 28 NA NA
2016-11-01 01:00:04 NA 938.10 NA NA 203 NA
2016-11-01 01:00:04 939.00 NA NA 11 NA NA
2016-11-01 01:00:05 NA 938.15 NA NA 19 NA
2016-11-01 01:00:06 NA 937.20 NA NA 100 NA
2016-11-01 01:00:06 938.00 NA NA 28 NA NA
2016-11-01 01:00:06 NA NA 938.10 NA NA 69
2016-11-01 01:00:06 NA NA 938.10 NA NA 831
2016-11-01 01:00:06 NA 938.10 NA NA 134 NA
时间序列数据的结构是
str(df_ts)
An ‘xts’ object on 2016-11-01 01:00:03/2016-11-02 12:59:37 containing:
Data: num [1:35797, 1:6] NA NA 938 NA 939 NA NA 938 NA NA ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:6] "Ask" "Bid" "Trade" "Ask_Size" ...
Indexed by objects of class: [POSIXct,POSIXt] TZ:
xts Attributes:
NULL
我尝试使用以下代码每 1 分钟聚合一次数据
# Creating a Function
apply.periodly <- function (x, FUN, period, k = 1, ...)
{
if (!require("xts")) {
stop("Need 'xts'")
}
ep <- endpoints(x, on = period, k=k)
period.apply(x, ep, FUN, ...)
}
# Aggregation every minute
df_aggregate_min <- apply.periodly(x = df_ts, FUN = mean, period = "minutes", k = 1)
但是由于数据中的 "NA",我得到了错误的输出。
如何通过忽略 NA 每分钟聚合列?
这是针对两个单独的列:
library(readr)
library(xts)
library(lubridate)
Sys.setenv(TZ = "UTC")
# hack: in-place edit of infile Sample_HFT.csv
# replace first comma with "T" to create ISO-datetime strings
# do this only ONCE!
system('perl -pi -E "s/,/T/" Sample_HFT.csv')
hft <- read_csv("Sample_HFT.csv", col_names = TRUE)
head(hft)
hft.xts <- as.xts(hft[, -1], order.by = ymd_hms(hft$T))
indexFormat(hft.xts) <- "%y-%m-%d %H:%M:%S"
my.cummean <- function(x) {
x2 <- x
cummeans <- cumsum(x2[!is.na(x)]) / seq_along(x2[!is.na(x)])
cummeans[endpoints(cummeans, "minutes"),]
}
ask_minutes <- split(hft.xts$Ask, f = "minutes")
ask_minutes_cum <- lapply(ask_minutes, my.cummean)
ask_minutes_mean <- do.call("rbind", ask_minutes_cum)
trade_size_minutes <- split(hft.xts$Trade_Size, f = "minutes")
trade_size_minutes_cum <- lapply(trade_size_minutes, my.cummean)
trade_size_minutes_mean <- do.call("rbind", trade_size_minutes_cum)
我仍然不知道这是否是所需的业务逻辑,但我认为您可以弄清楚细节。
head(trade_size_minutes_mean)
Trade_Size
16-11-01 01:00:35 194.500
16-11-01 01:01:59 59.909
16-11-01 01:02:48 5.875
16-11-01 01:03:34 6.000
16-11-01 01:08:57 3.889
16-11-01 01:09:29 1.682
我有一个格式为
的时间序列数据 Ask Bid Trade Ask_Size Bid_Size Trade_Size
2016-11-01 01:00:03 NA 938.10 NA NA 203 NA
2016-11-01 01:00:04 NA 937.20 NA NA 100 NA
2016-11-01 01:00:04 938.00 NA NA 28 NA NA
2016-11-01 01:00:04 NA 938.10 NA NA 203 NA
2016-11-01 01:00:04 939.00 NA NA 11 NA NA
2016-11-01 01:00:05 NA 938.15 NA NA 19 NA
2016-11-01 01:00:06 NA 937.20 NA NA 100 NA
2016-11-01 01:00:06 938.00 NA NA 28 NA NA
2016-11-01 01:00:06 NA NA 938.10 NA NA 69
2016-11-01 01:00:06 NA NA 938.10 NA NA 831
2016-11-01 01:00:06 NA 938.10 NA NA 134 NA
时间序列数据的结构是
str(df_ts)
An ‘xts’ object on 2016-11-01 01:00:03/2016-11-02 12:59:37 containing:
Data: num [1:35797, 1:6] NA NA 938 NA 939 NA NA 938 NA NA ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:6] "Ask" "Bid" "Trade" "Ask_Size" ...
Indexed by objects of class: [POSIXct,POSIXt] TZ:
xts Attributes:
NULL
我尝试使用以下代码每 1 分钟聚合一次数据
# Creating a Function
apply.periodly <- function (x, FUN, period, k = 1, ...)
{
if (!require("xts")) {
stop("Need 'xts'")
}
ep <- endpoints(x, on = period, k=k)
period.apply(x, ep, FUN, ...)
}
# Aggregation every minute
df_aggregate_min <- apply.periodly(x = df_ts, FUN = mean, period = "minutes", k = 1)
但是由于数据中的 "NA",我得到了错误的输出。 如何通过忽略 NA 每分钟聚合列?
这是针对两个单独的列:
library(readr)
library(xts)
library(lubridate)
Sys.setenv(TZ = "UTC")
# hack: in-place edit of infile Sample_HFT.csv
# replace first comma with "T" to create ISO-datetime strings
# do this only ONCE!
system('perl -pi -E "s/,/T/" Sample_HFT.csv')
hft <- read_csv("Sample_HFT.csv", col_names = TRUE)
head(hft)
hft.xts <- as.xts(hft[, -1], order.by = ymd_hms(hft$T))
indexFormat(hft.xts) <- "%y-%m-%d %H:%M:%S"
my.cummean <- function(x) {
x2 <- x
cummeans <- cumsum(x2[!is.na(x)]) / seq_along(x2[!is.na(x)])
cummeans[endpoints(cummeans, "minutes"),]
}
ask_minutes <- split(hft.xts$Ask, f = "minutes")
ask_minutes_cum <- lapply(ask_minutes, my.cummean)
ask_minutes_mean <- do.call("rbind", ask_minutes_cum)
trade_size_minutes <- split(hft.xts$Trade_Size, f = "minutes")
trade_size_minutes_cum <- lapply(trade_size_minutes, my.cummean)
trade_size_minutes_mean <- do.call("rbind", trade_size_minutes_cum)
我仍然不知道这是否是所需的业务逻辑,但我认为您可以弄清楚细节。
head(trade_size_minutes_mean)
Trade_Size
16-11-01 01:00:35 194.500
16-11-01 01:01:59 59.909
16-11-01 01:02:48 5.875
16-11-01 01:03:34 6.000
16-11-01 01:08:57 3.889
16-11-01 01:09:29 1.682