R 中的 10 分钟移动平均线到 1 小时移动平均线
10-min moving average to 1-hour moving average in R
我有一组10分钟移动平均的天气数据,以1分钟为间隔显示。我想将其转换为 1 小时平均值。
Date Direction Speed
1 2017-07-06 00:01:00 93 7.3
2 2017-07-06 00:02:00 92 7.4
3 2017-07-06 00:03:00 92 7.3
4 2017-07-06 00:04:00 91 7.4
5 2017-07-06 00:05:00 91 7.3
6 2017-07-06 00:06:00 91 7.3
7 2017-07-06 00:07:00 91 7.2
8 2017-07-06 00:08:00 90 7.1
9 2017-07-06 00:09:00 90 6.9
10 2017-07-06 00:10:00 91 6.7
...
(thousands of row of data in 1 min-interval
* 以上方向和速度均在 10 分钟移动平均线中
对于普通移动平均内置函数,它们会遇到每个邻域值,例如:
rollmean(timeLine$Speed, 60, fill=FALSE, align = "right")
将对遇到 n、n-1、n-2、n-3、...、n-59 的每个值产生滚动平均值。
但是,由于我的原始数据已经是 10 分钟的平均值,所以我只需要取值 n、n-10、n-20、n-30、n-40、n-50 即可进行转换它是一个小时的平均值。
比如我想要一个2001-07-0610:00:00每小时的数据,我只需要对以下几项取平均值即可:
- 2001-07-06 10:00:00
- 2001-07-0609:50:00
- 2001-07-06 09:40:00
- 2001-07-06 09:30:00
- 2001-07-06 09:20:00
- 2001-07-06 09:10:00
有什么方法可以让我在R上顺利计算出来吗?
在此先感谢您的帮助!
更新 1:这是 dput(head(timeLine, 10))
structure(
list(
Date = structure(c(1499270460, 1499270520, 1499270580, 1499270640, 1499270700, 1499270760, 1499270820, 1499270880, 1499270940, 1499271000),
class = c("POSIXct", "POSIXt"), tzone = "Asia/Hong_Kong"),
Direction = c(93L, 92L, 92L, 91L, 91L, 91L, 91L, 90L, 90L, 91L),
Speed = c(7.3, 7.4, 7.3, 7.4, 7.3, 7.3, 7.2, 7.1, 6.9, 6.7)),
.Names = c("Date", "Direction", "Speed"),
row.names = c(NA, 10L),
class = "data.frame")
嗯,几乎可以肯定有更优雅的方式,但我认为这个可行。我使用 lubridate
包来轻松转换为日期时间格式:
library(tidyverse)
library(lubridate)
df = read.csv(text="
Date,Time,Direction,Speed
2001-07-04,09:01:00,310,4.0
2001-07-04,09:02:00,310,3.9
2001-07-04,09:03:00,310,3.9
2001-07-04,09:04:00,310,3.9
2001-07-04,09:05:00,300,3.9
2001-07-04,09:06:00,300,4.0
2001-07-04,09:07:00,300,3.9
2001-07-04,09:08:00,300,4.0
2001-07-04,09:09:00,300,4.0
2001-07-04,09:10:00,300,4.0
2001-07-04,09:11:00,290,4.0
2001-07-04,09:12:00,290,4.0
2001-07-04,09:13:00,290,4.0
2001-07-04,09:14:00,290,4.0
2001-07-04,09:15:00,290,4.0", sep=",", header = TRUE, row.names = NULL)
lagged_avg = function(col) {
lag_positions = c(0,10,20,30,40,50)
sum = 0
for (n in lag_positions) {
sum = sum + lag(col, n)
}
return(sum/6)
}
df = df %>%
mutate(datetime = ymd_hms(paste0(Date," ",Time))) %>%
mutate(lag = lagged_avg(Speed)) %>%
select(-Date, -Time)
我会查看 tibbletime package - 具体来说,collapse_by()
函数很有用。以下应该有效(使用更多数据会更容易测试):
library(tidyverse)
library(lubridate)
library(tibbletime)
tbl_time(timeLine, index = Date) %>%
filter(minute(Date) %in% seq(0, 50, 10)) %>%
collapse_by("hour", clean = TRUE) %>%
group_by(Date) %>%
summarise_all(mean)
注意:根据您对工作时间的看法,您可能希望将 collapse_by
行更改为 collapse_by("hour", clean = TRUE, side = "start")
- 默认情况下,它将使用 side = "end"
。
一个解决方案是先过滤 0, 10, 20, 30, 40, 50th
分钟数据。可以将 date/time 的 minute
除以 10
并检查 remainder
是否等于 0 以过滤数据 0, 10, 20, 30, 40, 50th
分钟数据。每 6 个观察值应用 zoo::rollmean
。以这种方式,将使用第 10、20、30、40、50 和 0 分钟的数据计算每小时的平均值。最后筛选 minute == 0
(一个小时)。
library(zoo)
library(lubridate)
library(tidyverse)
timeLine_mod %>% filter(minute(Date) %% 10 == 0) %>%
mutate(meanSpeed = rollmean(Speed, 6, fill = FALSE, align = "right")) %>%
filter(minute(Date) == 0)
# Date Direction Speed meanSpeed
# 1 2017-07-06 01:00:00 91 6.7 6.7
# 2 2017-07-06 02:00:00 91 6.7 6.7
# 3 2017-07-06 03:00:00 91 6.7 6.7
数据:由于OP只提供了10分钟的数据,不足以计算每小时平均值。因此,我将数据扩展到 3 小时:
timeLine <- structure(list(Date = structure(c(1499270460, 1499270520, 1499270580,
1499270640, 1499270700, 1499270760, 1499270820, 1499270880, 1499270940, 1499271000),
class = c("POSIXct", "POSIXt"), tzone = "Asia/Hong_Kong"),
Direction = c(93L, 92L, 92L, 91L, 91L, 91L, 91L, 90L, 90L, 91L),
Speed = c(7.3, 7.4, 7.3, 7.4, 7.3, 7.3, 7.2, 7.1, 6.9, 6.7)),
.Names = c("Date", "Direction", "Speed"), row.names = c(NA, 10L),
class = "data.frame")
#Extend data to cover 3 hours as
timeLine_mod <- timeLine %>% complete(Date = seq(min(Date),
min(Date)+60*60*3-60,by="1 min"))
#Repeat the value of Direction and Speed
timeLine_mod$Direction <- timeLine$Direction
timeLine_mod$Speed <- timeLine$Speed
rollapplyr
(最后的 r
表示右对齐)在动物园中允许使用 width = list(offset_vector)
指定偏移量,如下所示:
transform(timeLine, avg = rollapplyr(Speed, list(seq(-50, 0, 10)), mean, fill = NA))
我有一组10分钟移动平均的天气数据,以1分钟为间隔显示。我想将其转换为 1 小时平均值。
Date Direction Speed
1 2017-07-06 00:01:00 93 7.3
2 2017-07-06 00:02:00 92 7.4
3 2017-07-06 00:03:00 92 7.3
4 2017-07-06 00:04:00 91 7.4
5 2017-07-06 00:05:00 91 7.3
6 2017-07-06 00:06:00 91 7.3
7 2017-07-06 00:07:00 91 7.2
8 2017-07-06 00:08:00 90 7.1
9 2017-07-06 00:09:00 90 6.9
10 2017-07-06 00:10:00 91 6.7
...
(thousands of row of data in 1 min-interval
* 以上方向和速度均在 10 分钟移动平均线中
对于普通移动平均内置函数,它们会遇到每个邻域值,例如:
rollmean(timeLine$Speed, 60, fill=FALSE, align = "right")
将对遇到 n、n-1、n-2、n-3、...、n-59 的每个值产生滚动平均值。
但是,由于我的原始数据已经是 10 分钟的平均值,所以我只需要取值 n、n-10、n-20、n-30、n-40、n-50 即可进行转换它是一个小时的平均值。
比如我想要一个2001-07-0610:00:00每小时的数据,我只需要对以下几项取平均值即可:
- 2001-07-06 10:00:00
- 2001-07-0609:50:00
- 2001-07-06 09:40:00
- 2001-07-06 09:30:00
- 2001-07-06 09:20:00
- 2001-07-06 09:10:00
有什么方法可以让我在R上顺利计算出来吗?
在此先感谢您的帮助!
更新 1:这是 dput(head(timeLine, 10))
structure(
list(
Date = structure(c(1499270460, 1499270520, 1499270580, 1499270640, 1499270700, 1499270760, 1499270820, 1499270880, 1499270940, 1499271000),
class = c("POSIXct", "POSIXt"), tzone = "Asia/Hong_Kong"),
Direction = c(93L, 92L, 92L, 91L, 91L, 91L, 91L, 90L, 90L, 91L),
Speed = c(7.3, 7.4, 7.3, 7.4, 7.3, 7.3, 7.2, 7.1, 6.9, 6.7)),
.Names = c("Date", "Direction", "Speed"),
row.names = c(NA, 10L),
class = "data.frame")
嗯,几乎可以肯定有更优雅的方式,但我认为这个可行。我使用 lubridate
包来轻松转换为日期时间格式:
library(tidyverse)
library(lubridate)
df = read.csv(text="
Date,Time,Direction,Speed
2001-07-04,09:01:00,310,4.0
2001-07-04,09:02:00,310,3.9
2001-07-04,09:03:00,310,3.9
2001-07-04,09:04:00,310,3.9
2001-07-04,09:05:00,300,3.9
2001-07-04,09:06:00,300,4.0
2001-07-04,09:07:00,300,3.9
2001-07-04,09:08:00,300,4.0
2001-07-04,09:09:00,300,4.0
2001-07-04,09:10:00,300,4.0
2001-07-04,09:11:00,290,4.0
2001-07-04,09:12:00,290,4.0
2001-07-04,09:13:00,290,4.0
2001-07-04,09:14:00,290,4.0
2001-07-04,09:15:00,290,4.0", sep=",", header = TRUE, row.names = NULL)
lagged_avg = function(col) {
lag_positions = c(0,10,20,30,40,50)
sum = 0
for (n in lag_positions) {
sum = sum + lag(col, n)
}
return(sum/6)
}
df = df %>%
mutate(datetime = ymd_hms(paste0(Date," ",Time))) %>%
mutate(lag = lagged_avg(Speed)) %>%
select(-Date, -Time)
我会查看 tibbletime package - 具体来说,collapse_by()
函数很有用。以下应该有效(使用更多数据会更容易测试):
library(tidyverse)
library(lubridate)
library(tibbletime)
tbl_time(timeLine, index = Date) %>%
filter(minute(Date) %in% seq(0, 50, 10)) %>%
collapse_by("hour", clean = TRUE) %>%
group_by(Date) %>%
summarise_all(mean)
注意:根据您对工作时间的看法,您可能希望将 collapse_by
行更改为 collapse_by("hour", clean = TRUE, side = "start")
- 默认情况下,它将使用 side = "end"
。
一个解决方案是先过滤 0, 10, 20, 30, 40, 50th
分钟数据。可以将 date/time 的 minute
除以 10
并检查 remainder
是否等于 0 以过滤数据 0, 10, 20, 30, 40, 50th
分钟数据。每 6 个观察值应用 zoo::rollmean
。以这种方式,将使用第 10、20、30、40、50 和 0 分钟的数据计算每小时的平均值。最后筛选 minute == 0
(一个小时)。
library(zoo)
library(lubridate)
library(tidyverse)
timeLine_mod %>% filter(minute(Date) %% 10 == 0) %>%
mutate(meanSpeed = rollmean(Speed, 6, fill = FALSE, align = "right")) %>%
filter(minute(Date) == 0)
# Date Direction Speed meanSpeed
# 1 2017-07-06 01:00:00 91 6.7 6.7
# 2 2017-07-06 02:00:00 91 6.7 6.7
# 3 2017-07-06 03:00:00 91 6.7 6.7
数据:由于OP只提供了10分钟的数据,不足以计算每小时平均值。因此,我将数据扩展到 3 小时:
timeLine <- structure(list(Date = structure(c(1499270460, 1499270520, 1499270580,
1499270640, 1499270700, 1499270760, 1499270820, 1499270880, 1499270940, 1499271000),
class = c("POSIXct", "POSIXt"), tzone = "Asia/Hong_Kong"),
Direction = c(93L, 92L, 92L, 91L, 91L, 91L, 91L, 90L, 90L, 91L),
Speed = c(7.3, 7.4, 7.3, 7.4, 7.3, 7.3, 7.2, 7.1, 6.9, 6.7)),
.Names = c("Date", "Direction", "Speed"), row.names = c(NA, 10L),
class = "data.frame")
#Extend data to cover 3 hours as
timeLine_mod <- timeLine %>% complete(Date = seq(min(Date),
min(Date)+60*60*3-60,by="1 min"))
#Repeat the value of Direction and Speed
timeLine_mod$Direction <- timeLine$Direction
timeLine_mod$Speed <- timeLine$Speed
rollapplyr
(最后的 r
表示右对齐)在动物园中允许使用 width = list(offset_vector)
指定偏移量,如下所示:
transform(timeLine, avg = rollapplyr(Speed, list(seq(-50, 0, 10)), mean, fill = NA))