处理时间序列数据中的连续缺失值
Handle Continous Missing values in time-series data
我有一个时间序列数据,如下所示。
2015-04-26 23:00:00 5704.27388916015661380
2015-04-27 00:00:00 4470.30868326822928793
2015-04-27 01:00:00 4552.57241617838553793
2015-04-27 02:00:00 4570.22250032825650123
2015-04-27 03:00:00 NA
2015-04-27 04:00:00 NA
2015-04-27 05:00:00 NA
2015-04-27 06:00:00 12697.37724086216439900
2015-04-27 07:00:00 5538.71119009653739340
2015-04-27 08:00:00 81.95060647328695325
2015-04-27 09:00:00 8550.65816895300667966
2015-04-27 10:00:00 2925.76573206583680076
我应该如何处理连续 NA 值。在我只有一个 NA 的情况下,我会使用 NA 条目的极值的平均值。是否有任何标准方法来处理连续缺失值?
zoo
包有几个处理 NA
值的函数。以下功能之一可能适合您的需要:
na.locf
:上次观察结转。使用参数 fromLast = TRUE
对应于下一个观察结果(NOCB)。
na.aggregate
:用一些聚合值替换 NA
。默认聚合函数是 mean
,但您也可以指定其他函数。有关详细信息,请参阅 ?na.aggregate
。
na.approx
:NA
被替换为线性插值。
您可以比较结果以了解这些函数的作用:
library(zoo)
df$V.loc <- na.locf(df$V2)
df$V.agg <- na.aggregate(df$V2)
df$V.app <- na.approx(df$V2)
这导致:
> df
V1 V2 V.loc V.agg V.app
1 2015-04-26 23:00:00 5704.27389 5704.27389 5704.27389 5704.27389
2 2015-04-27 00:00:00 4470.30868 4470.30868 4470.30868 4470.30868
3 2015-04-27 01:00:00 4552.57242 4552.57242 4552.57242 4552.57242
4 2015-04-27 02:00:00 4570.22250 4570.22250 4570.22250 4570.22250
5 2015-04-27 03:00:00 NA 4570.22250 5454.64894 6602.01119
6 2015-04-27 04:00:00 NA 4570.22250 5454.64894 8633.79987
7 2015-04-27 05:00:00 NA 4570.22250 5454.64894 10665.58856
8 2015-04-27 06:00:00 12697.37724 12697.37724 12697.37724 12697.37724
9 2015-04-27 07:00:00 5538.71119 5538.71119 5538.71119 5538.71119
10 2015-04-27 08:00:00 81.95061 81.95061 81.95061 81.95061
11 2015-04-27 09:00:00 8550.65817 8550.65817 8550.65817 8550.65817
12 2015-04-27 10:00:00 2925.76573 2925.76573 2925.76573 2925.76573
已用数据:
df <- structure(list(V1 = structure(c(1430082000, 1430085600, 1430089200, 1430092800, 1430096400, 1430100000, 1430103600, 1430107200, 1430110800, 1430114400, 1430118000, 1430121600), class = c("POSIXct", "POSIXt"), tzone = ""), V2 = c(5704.27388916016, 4470.30868326823, 4552.57241617839, 4570.22250032826, NA, NA, NA, 12697.3772408622, 5538.71119009654, 81.950606473287, 8550.65816895301, 2925.76573206584)), .Names = c("V1", "V2"), row.names = c(NA, -12L), class = "data.frame")
加法:
imputeTS
和 forecast
包中还有处理 NA 的额外时间序列函数(还有一些更高级的函数)。
例如:
library("imputeTS")
# Moving Average Imputation
na_ma(df$V2)
# Imputation via Kalman Smoothing on structural time series models
na_kalman(df$V2)
# Just interpolation but with some nice options (linear, spline,stine)
na_interpolation(df$V2)
或
library("forecast")
#Interpolation via seasonal decomposition and interpolation
na.interp(df$V2)
我有一个时间序列数据,如下所示。
2015-04-26 23:00:00 5704.27388916015661380
2015-04-27 00:00:00 4470.30868326822928793
2015-04-27 01:00:00 4552.57241617838553793
2015-04-27 02:00:00 4570.22250032825650123
2015-04-27 03:00:00 NA
2015-04-27 04:00:00 NA
2015-04-27 05:00:00 NA
2015-04-27 06:00:00 12697.37724086216439900
2015-04-27 07:00:00 5538.71119009653739340
2015-04-27 08:00:00 81.95060647328695325
2015-04-27 09:00:00 8550.65816895300667966
2015-04-27 10:00:00 2925.76573206583680076
我应该如何处理连续 NA 值。在我只有一个 NA 的情况下,我会使用 NA 条目的极值的平均值。是否有任何标准方法来处理连续缺失值?
zoo
包有几个处理 NA
值的函数。以下功能之一可能适合您的需要:
na.locf
:上次观察结转。使用参数fromLast = TRUE
对应于下一个观察结果(NOCB)。na.aggregate
:用一些聚合值替换NA
。默认聚合函数是mean
,但您也可以指定其他函数。有关详细信息,请参阅?na.aggregate
。na.approx
:NA
被替换为线性插值。
您可以比较结果以了解这些函数的作用:
library(zoo)
df$V.loc <- na.locf(df$V2)
df$V.agg <- na.aggregate(df$V2)
df$V.app <- na.approx(df$V2)
这导致:
> df
V1 V2 V.loc V.agg V.app
1 2015-04-26 23:00:00 5704.27389 5704.27389 5704.27389 5704.27389
2 2015-04-27 00:00:00 4470.30868 4470.30868 4470.30868 4470.30868
3 2015-04-27 01:00:00 4552.57242 4552.57242 4552.57242 4552.57242
4 2015-04-27 02:00:00 4570.22250 4570.22250 4570.22250 4570.22250
5 2015-04-27 03:00:00 NA 4570.22250 5454.64894 6602.01119
6 2015-04-27 04:00:00 NA 4570.22250 5454.64894 8633.79987
7 2015-04-27 05:00:00 NA 4570.22250 5454.64894 10665.58856
8 2015-04-27 06:00:00 12697.37724 12697.37724 12697.37724 12697.37724
9 2015-04-27 07:00:00 5538.71119 5538.71119 5538.71119 5538.71119
10 2015-04-27 08:00:00 81.95061 81.95061 81.95061 81.95061
11 2015-04-27 09:00:00 8550.65817 8550.65817 8550.65817 8550.65817
12 2015-04-27 10:00:00 2925.76573 2925.76573 2925.76573 2925.76573
已用数据:
df <- structure(list(V1 = structure(c(1430082000, 1430085600, 1430089200, 1430092800, 1430096400, 1430100000, 1430103600, 1430107200, 1430110800, 1430114400, 1430118000, 1430121600), class = c("POSIXct", "POSIXt"), tzone = ""), V2 = c(5704.27388916016, 4470.30868326823, 4552.57241617839, 4570.22250032826, NA, NA, NA, 12697.3772408622, 5538.71119009654, 81.950606473287, 8550.65816895301, 2925.76573206584)), .Names = c("V1", "V2"), row.names = c(NA, -12L), class = "data.frame")
加法:
imputeTS
和 forecast
包中还有处理 NA 的额外时间序列函数(还有一些更高级的函数)。
例如:
library("imputeTS")
# Moving Average Imputation
na_ma(df$V2)
# Imputation via Kalman Smoothing on structural time series models
na_kalman(df$V2)
# Just interpolation but with some nice options (linear, spline,stine)
na_interpolation(df$V2)
或
library("forecast")
#Interpolation via seasonal decomposition and interpolation
na.interp(df$V2)