如何在不同长度的时间序列中创建随机间隙?
How to create random gaps in a time series with different lengths?
对于我的硕士论文,我必须在现有数据集上检查不同的空白填充方法。因此,我必须添加不同长度(1h、5h..)的人工间隙,以便我可以用不同的方法填充它们。有没有简单的功能可以做到这一点?
这里是数据框的例子:
structure(list(DateTime = structure(c(1420074000, 1420077600,
1420081200, 1420084800, 1420088400, 1420092000, 1420095600, 1420099200,
1420102800, 1420106400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`Dd 1-1` = c(0.0186269166666667, 0.0242605625, 0.00373020138888889,
0.000966965277777778, 0.0119253611111111, 0.0495888958333333,
0.02014125, 0.0306862638888889, 0.0324395694444444, 0.0191942152777778
), `Dd 1-3` = c(0.0242500833333333, 0.0349086388888889, 0,
0.00135595138888889, 0.0221090138888889, 0.0600941527777778,
0.0462282986111111, 0.0171887638888889, 0.0481975347222222,
0.0226582152777778), `Dd 1-5` = c(0.0212732152777778, 0.0284445347222222,
0.00276098611111111, 0.0142581875, 0.0276248958333333, 0.0328644027777778,
0.0495009166666667, 0.0173377777777778, 0.0384788194444444,
0.017663875), luecken = c(0.0186269166666667, 0.0242605625,
0.00373020138888889, 0.000966965277777778, 0.0119253611111111,
0.0495888958333333, 0.02014125, 0.0306862638888889, 0.0324395694444444,
0.0191942152777778)), row.names = c(NA, 10L), class = c("tbl_df",
"tbl", "data.frame"))
如果我正确理解了您的问题,一种可能的解决方案是:
set.seed(4) # make it reproducable
del <- sort(sample(1:nrow(df), 4, replace=FALSE)) # get 4 random indexex from the total number of rows and sort them
del2 <- del[diff(del) !=1] # delete those values that have a difference of 1 (meaning "connected")
df[del2, c(2:5)] <- NA # set column 2 to 5 NA for the indices we calculated above
DateTime `Dd 1-1` `Dd 1-3` `Dd 1-5` luecken
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2015-01-01 01:00:00 0.0186 0.0243 0.0213 0.0186
2 2015-01-01 02:00:00 0.0243 0.0349 0.0284 0.0243
3 2015-01-01 03:00:00 NA NA NA NA
4 2015-01-01 04:00:00 0.000967 0.00136 0.0143 0.000967
5 2015-01-01 05:00:00 0.0119 0.0221 0.0276 0.0119
6 2015-01-01 06:00:00 0.0496 0.0601 0.0329 0.0496
7 2015-01-01 07:00:00 0.0201 0.0462 0.0495 0.0201
8 2015-01-01 08:00:00 0.0307 0.0172 0.0173 0.0307
9 2015-01-01 09:00:00 NA NA NA NA
10 2015-01-01 10:00:00 0.0192 0.0227 0.0177 0.0192
需要说明的是:清除连接间隙的步骤并不完全正确,因为在随机数为 1 - 4 的情况下,这会丢弃 2、3 和 4,但在大数据上它应该是一个足够的解决方案如果您不打算与整个数据集相比删除很多值
现在了解如何创建更大的间隙(我将使用 3h,因为您的示例数据只有 10 行)
set.seed(4)
del <- sort(sample(1:nrow(df), 3, replace=FALSE))
del2 <- del[diff(del) > 3] #set difference to more than maximum size of gap wanted
del3 <- c(del2, del2 + 1, del2 + 2) # build vector with +1 and +2 to get indices conecting conecting to the onces you have
del4 <- del3[del3 <= nrow(df)] # make sure it is not out of bound (max index should be 10 even if gap starts at line 10
df[del4, c(2:5)] <- NA
DateTime `Dd 1-1` `Dd 1-3` `Dd 1-5` luecken
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2015-01-01 01:00:00 0.0186 0.0243 0.0213 0.0186
2 2015-01-01 02:00:00 0.0243 0.0349 0.0284 0.0243
3 2015-01-01 03:00:00 NA NA NA NA
4 2015-01-01 04:00:00 NA NA NA NA
5 2015-01-01 05:00:00 NA NA NA NA
6 2015-01-01 06:00:00 0.0496 0.0601 0.0329 0.0496
7 2015-01-01 07:00:00 0.0201 0.0462 0.0495 0.0201
8 2015-01-01 08:00:00 0.0307 0.0172 0.0173 0.0307
9 2015-01-01 09:00:00 NA NA NA NA
10 2015-01-01 10:00:00 NA NA NA NA
对于我的硕士论文,我必须在现有数据集上检查不同的空白填充方法。因此,我必须添加不同长度(1h、5h..)的人工间隙,以便我可以用不同的方法填充它们。有没有简单的功能可以做到这一点?
这里是数据框的例子:
structure(list(DateTime = structure(c(1420074000, 1420077600,
1420081200, 1420084800, 1420088400, 1420092000, 1420095600, 1420099200,
1420102800, 1420106400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`Dd 1-1` = c(0.0186269166666667, 0.0242605625, 0.00373020138888889,
0.000966965277777778, 0.0119253611111111, 0.0495888958333333,
0.02014125, 0.0306862638888889, 0.0324395694444444, 0.0191942152777778
), `Dd 1-3` = c(0.0242500833333333, 0.0349086388888889, 0,
0.00135595138888889, 0.0221090138888889, 0.0600941527777778,
0.0462282986111111, 0.0171887638888889, 0.0481975347222222,
0.0226582152777778), `Dd 1-5` = c(0.0212732152777778, 0.0284445347222222,
0.00276098611111111, 0.0142581875, 0.0276248958333333, 0.0328644027777778,
0.0495009166666667, 0.0173377777777778, 0.0384788194444444,
0.017663875), luecken = c(0.0186269166666667, 0.0242605625,
0.00373020138888889, 0.000966965277777778, 0.0119253611111111,
0.0495888958333333, 0.02014125, 0.0306862638888889, 0.0324395694444444,
0.0191942152777778)), row.names = c(NA, 10L), class = c("tbl_df",
"tbl", "data.frame"))
如果我正确理解了您的问题,一种可能的解决方案是:
set.seed(4) # make it reproducable
del <- sort(sample(1:nrow(df), 4, replace=FALSE)) # get 4 random indexex from the total number of rows and sort them
del2 <- del[diff(del) !=1] # delete those values that have a difference of 1 (meaning "connected")
df[del2, c(2:5)] <- NA # set column 2 to 5 NA for the indices we calculated above
DateTime `Dd 1-1` `Dd 1-3` `Dd 1-5` luecken
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2015-01-01 01:00:00 0.0186 0.0243 0.0213 0.0186
2 2015-01-01 02:00:00 0.0243 0.0349 0.0284 0.0243
3 2015-01-01 03:00:00 NA NA NA NA
4 2015-01-01 04:00:00 0.000967 0.00136 0.0143 0.000967
5 2015-01-01 05:00:00 0.0119 0.0221 0.0276 0.0119
6 2015-01-01 06:00:00 0.0496 0.0601 0.0329 0.0496
7 2015-01-01 07:00:00 0.0201 0.0462 0.0495 0.0201
8 2015-01-01 08:00:00 0.0307 0.0172 0.0173 0.0307
9 2015-01-01 09:00:00 NA NA NA NA
10 2015-01-01 10:00:00 0.0192 0.0227 0.0177 0.0192
需要说明的是:清除连接间隙的步骤并不完全正确,因为在随机数为 1 - 4 的情况下,这会丢弃 2、3 和 4,但在大数据上它应该是一个足够的解决方案如果您不打算与整个数据集相比删除很多值
现在了解如何创建更大的间隙(我将使用 3h,因为您的示例数据只有 10 行)
set.seed(4)
del <- sort(sample(1:nrow(df), 3, replace=FALSE))
del2 <- del[diff(del) > 3] #set difference to more than maximum size of gap wanted
del3 <- c(del2, del2 + 1, del2 + 2) # build vector with +1 and +2 to get indices conecting conecting to the onces you have
del4 <- del3[del3 <= nrow(df)] # make sure it is not out of bound (max index should be 10 even if gap starts at line 10
df[del4, c(2:5)] <- NA
DateTime `Dd 1-1` `Dd 1-3` `Dd 1-5` luecken
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2015-01-01 01:00:00 0.0186 0.0243 0.0213 0.0186
2 2015-01-01 02:00:00 0.0243 0.0349 0.0284 0.0243
3 2015-01-01 03:00:00 NA NA NA NA
4 2015-01-01 04:00:00 NA NA NA NA
5 2015-01-01 05:00:00 NA NA NA NA
6 2015-01-01 06:00:00 0.0496 0.0601 0.0329 0.0496
7 2015-01-01 07:00:00 0.0201 0.0462 0.0495 0.0201
8 2015-01-01 08:00:00 0.0307 0.0172 0.0173 0.0307
9 2015-01-01 09:00:00 NA NA NA NA
10 2015-01-01 10:00:00 NA NA NA NA