提取滞后数据,但仅限于 R 中的特定季节
Pulling lagged data but only for a particular season in R
我有一个包含两个变量的特定数据集。一个是数字,另一个是标识数字数据来自的季节和年份的字符。这是数据头部的样子:
SeasonYear mean
<chr> <dbl>
1 winter2000 0.957
2 spring2000 0.943
3 summer2000 1.03
4 fall2000 0.981
5 winter2001 1.06
6 spring2001 1.05
7 summer2001 1.02
8 fall2001 1.03
9 winter2002 1.02
10 spring2002 1.05
现在我希望拉动此数据的延迟,但仅限于之前的 spring,以便我的数据看起来像这样:
SeasonYear mean lag
<chr> <dbl> <dbl>
1 winter2000 0.957 NA
2 spring2000 0.943 NA
3 summer2000 1.03 0.943
4 fall2000 0.981 0.943
5 winter2001 1.06 0.943
6 spring2001 1.05 0.943
7 summer2001 1.02 1.05
8 fall2001 1.03 1.05
9 winter2002 1.02 1.05
10 spring2002 1.05 1.05
我也希望返回 2 springs 以便我的数据看起来像这样:
SeasonYear mean lag
<chr> <dbl> <dbl>
1 winter2000 0.957 NA
2 spring2000 0.943 NA
3 summer2000 1.03 NA
4 fall2000 0.981 NA
5 winter2001 1.06 NA
6 spring2001 1.05 NA
7 summer2001 1.02 0.943
8 fall2001 1.03 0.943
9 winter2002 1.02 0.943
10 spring2002 1.05 0.943
我知道我可以使用 lag()
函数来获取数据框中的先前数据,但我正在寻找一种方法来指定一个函数,该函数可以像我提到的那样拉出特定类型的滞后。
实现您想要的结果的一个选项可能如下所示:
- 将您的 SeasonYear 拆分为季节和年份
- 在每一年
中添加一个包含 spring 值的列
- 考虑到秋季和夏季的第 (n-1) 个滞后,得到第 n 个滞后
library(tidyr)
library(dplyr)
lag_spring <- function(x, y, n = 1) {
data.frame(x = x, season_year = y) %>%
tidyr::extract(season_year, into = c("season", "year"), regex = "^(.+?)(\d{4})$") %>%
group_by(year) %>%
mutate(springmean = x[season == "spring"]) %>%
ungroup() %>%
group_by(season) %>%
mutate(lag = ifelse(!season %in% c("summer", "fall"), lag(springmean, n = n), lag(springmean, n = n - 1))) %>%
ungroup() %>%
pull(lag)
}
dd %>%
mutate(lag = lag_spring(mean, SeasonYear))
#> SeasonYear mean lag
#> 1 winter2000 0.957 NA
#> 2 spring2000 0.943 NA
#> 3 summer2000 1.030 0.943
#> 4 fall2000 0.981 0.943
#> 5 winter2001 1.060 0.943
#> 6 spring2001 1.050 0.943
#> 7 summer2001 1.020 1.050
#> 8 fall2001 1.030 1.050
#> 9 winter2002 1.020 1.050
#> 10 spring2002 1.050 1.050
dd %>%
mutate(lag = lag_spring(mean, SeasonYear, n = 2))
#> SeasonYear mean lag
#> 1 winter2000 0.957 NA
#> 2 spring2000 0.943 NA
#> 3 summer2000 1.030 NA
#> 4 fall2000 0.981 NA
#> 5 winter2001 1.060 NA
#> 6 spring2001 1.050 NA
#> 7 summer2001 1.020 0.943
#> 8 fall2001 1.030 0.943
#> 9 winter2002 1.020 0.943
#> 10 spring2002 1.050 0.943
数据
dd <- structure(list(SeasonYear = c(
"winter2000", "spring2000", "summer2000",
"fall2000", "winter2001", "spring2001", "summer2001", "fall2001",
"winter2002", "spring2002"
), mean = c(
0.957, 0.943, 1.03, 0.981,
1.06, 1.05, 1.02, 1.03, 1.02, 1.05
)), class = "data.frame", row.names = c(
"1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"
))
我有一个包含两个变量的特定数据集。一个是数字,另一个是标识数字数据来自的季节和年份的字符。这是数据头部的样子:
SeasonYear mean
<chr> <dbl>
1 winter2000 0.957
2 spring2000 0.943
3 summer2000 1.03
4 fall2000 0.981
5 winter2001 1.06
6 spring2001 1.05
7 summer2001 1.02
8 fall2001 1.03
9 winter2002 1.02
10 spring2002 1.05
现在我希望拉动此数据的延迟,但仅限于之前的 spring,以便我的数据看起来像这样:
SeasonYear mean lag
<chr> <dbl> <dbl>
1 winter2000 0.957 NA
2 spring2000 0.943 NA
3 summer2000 1.03 0.943
4 fall2000 0.981 0.943
5 winter2001 1.06 0.943
6 spring2001 1.05 0.943
7 summer2001 1.02 1.05
8 fall2001 1.03 1.05
9 winter2002 1.02 1.05
10 spring2002 1.05 1.05
我也希望返回 2 springs 以便我的数据看起来像这样:
SeasonYear mean lag
<chr> <dbl> <dbl>
1 winter2000 0.957 NA
2 spring2000 0.943 NA
3 summer2000 1.03 NA
4 fall2000 0.981 NA
5 winter2001 1.06 NA
6 spring2001 1.05 NA
7 summer2001 1.02 0.943
8 fall2001 1.03 0.943
9 winter2002 1.02 0.943
10 spring2002 1.05 0.943
我知道我可以使用 lag()
函数来获取数据框中的先前数据,但我正在寻找一种方法来指定一个函数,该函数可以像我提到的那样拉出特定类型的滞后。
实现您想要的结果的一个选项可能如下所示:
- 将您的 SeasonYear 拆分为季节和年份
- 在每一年 中添加一个包含 spring 值的列
- 考虑到秋季和夏季的第 (n-1) 个滞后,得到第 n 个滞后
library(tidyr)
library(dplyr)
lag_spring <- function(x, y, n = 1) {
data.frame(x = x, season_year = y) %>%
tidyr::extract(season_year, into = c("season", "year"), regex = "^(.+?)(\d{4})$") %>%
group_by(year) %>%
mutate(springmean = x[season == "spring"]) %>%
ungroup() %>%
group_by(season) %>%
mutate(lag = ifelse(!season %in% c("summer", "fall"), lag(springmean, n = n), lag(springmean, n = n - 1))) %>%
ungroup() %>%
pull(lag)
}
dd %>%
mutate(lag = lag_spring(mean, SeasonYear))
#> SeasonYear mean lag
#> 1 winter2000 0.957 NA
#> 2 spring2000 0.943 NA
#> 3 summer2000 1.030 0.943
#> 4 fall2000 0.981 0.943
#> 5 winter2001 1.060 0.943
#> 6 spring2001 1.050 0.943
#> 7 summer2001 1.020 1.050
#> 8 fall2001 1.030 1.050
#> 9 winter2002 1.020 1.050
#> 10 spring2002 1.050 1.050
dd %>%
mutate(lag = lag_spring(mean, SeasonYear, n = 2))
#> SeasonYear mean lag
#> 1 winter2000 0.957 NA
#> 2 spring2000 0.943 NA
#> 3 summer2000 1.030 NA
#> 4 fall2000 0.981 NA
#> 5 winter2001 1.060 NA
#> 6 spring2001 1.050 NA
#> 7 summer2001 1.020 0.943
#> 8 fall2001 1.030 0.943
#> 9 winter2002 1.020 0.943
#> 10 spring2002 1.050 0.943
数据
dd <- structure(list(SeasonYear = c(
"winter2000", "spring2000", "summer2000",
"fall2000", "winter2001", "spring2001", "summer2001", "fall2001",
"winter2002", "spring2002"
), mean = c(
0.957, 0.943, 1.03, 0.981,
1.06, 1.05, 1.02, 1.03, 1.02, 1.05
)), class = "data.frame", row.names = c(
"1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"
))