使用 Tidymodels 阻止引导
Block Bootstrapping using Tidymodels
我有一个每月(1 月 - 12 月) 天气和作物产量数据集。此数据收集了 年(2002 - 2019)。我的目标是获得每个月温度对产量差距影响的 bootstrapped 斜率系数。在 bootstrapping 中,我想以一种方式阻止年份信息,即函数应在每个 bootstrap 中随机抽取特定年份的数据,而不是从混合年份中选择行。
我阅读了一些博客并尝试了不同的方法,但我对这些没有信心。我试图剖析 bootstrapped 拆分以确保我做的是否正确,但我没有。
这是起始代码:
# Load libraries
library(readxl)
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(reprex)
# data
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (3): ID, Location, Month
#> dbl (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date (1): Date
#>
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
ww_wt %>%
select(Year, Month, gap, temp) %>%
head()
#> # A tibble: 6 x 4
#> Year Month gap temp
#> <dbl> <chr> <dbl> <dbl>
#> 1 2002 September 0.282 13.6
#> 2 2002 October 0.282 13.3
#> 3 2002 November 0.282 7.07
#> 4 2002 December 0.282 3.44
#> 5 2002 January 0.282 5.61
#> 6 2002 February 0.282 6.93
# Bootstrapping
set.seed(123)
boots <- ww_wt %>%
ungroup() %>%
select(Year, Month, gap, temp) %>%
nest(data = -c(Month)) %>%
mutate(boots = map(data, ~bootstraps(.x, times = 100, apparent = FALSE))) %>%
unnest(boots) %>%
mutate(model = map(splits, ~lm(gap ~ temp, data = analysis(.))),
coefs = map(model, tidy))
由 reprex package (v2.0.1)
于 2022 年 1 月 4 日创建
我正在嵌套 Months
因为我想分别获得每个月的坡度。此外,每年的数据具有不同的样本大小 n
,因为每年的地点数量不同。
我们目前不支持分组或阻止引导;我们正在追踪 interest in more group-based methods here.
如果你想创建一个重采样方案来保存整组数据,你可以查看 group_vfold_cv()
(也许与 nested_cv()
一起?)看看它是否符合你的需求与此同时。它导致重采样方案如下所示:
library(tidyverse)
library(tidymodels)
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): ID, Location, Month
#> dbl (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date (1): Date
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(123)
folds <-
ww_wt %>%
filter(Month == "September") %>%
select(Year, Month, gap, temp) %>%
group_vfold_cv(
group = Year,
v = 7
)
folds
#> # Group 7-fold cross-validation
#> # A tibble: 7 × 2
#> splits id
#> <list> <chr>
#> 1 <split [135/26]> Resample1
#> 2 <split [137/24]> Resample2
#> 3 <split [133/28]> Resample3
#> 4 <split [132/29]> Resample4
#> 5 <split [142/19]> Resample5
#> 6 <split [144/17]> Resample6
#> 7 <split [143/18]> Resample7
tidy(folds) %>%
ggplot(aes(x = Resample, y = Row, fill = Data)) +
geom_tile() + scale_fill_brewer()
由 reprex package (v2.0.1)
创建于 2022-01-07
如果你愿意,你可以提高 v
,你可以先 Month
嵌套,每个月都这样做。
感谢朱莉娅提供这些提示。我想我已经通过添加一些额外的代码行解决了这个问题。
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(viridis)
#> Loading required package: viridisLite
#>
#> Attaching package: 'viridis'
#> The following object is masked from 'package:scales':
#>
#> viridis_pal
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (3): ID, Location, Month
#> dbl (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date (1): Date
#>
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(123)
# Block bootstrapping
boots <- ww_wt %>%
ungroup() %>%
select(Year, Month, gap, temp) %>%
pivot_wider(names_from = Month, values_from = temp, values_fn = mean) %>%
bootstraps(times = 10, apparent = FALSE) %>%
mutate(splits = map(splits, analysis)) %>%
unnest(splits) %>%
group_by(id) %>%
mutate(row = row_number()) %>%
pivot_longer(names_to = "Month", values_to = "temp", cols = September:August)
# Bootstraps
boots %>%
group_by(id) %>%
ggplot(aes(x = id, y = row, fill = Year)) +
geom_tile() +
scale_fill_viridis(option = "B", direction = 1) +
labs(x = NULL) +
facet_wrap(~Month) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
以上图表清楚地表明,在每个 bootstrap 中,来自每个 Year
的数据正在随机抽样并进行替换。
由 reprex package (v2.0.1)
于 2022-01-08 创建
我有一个每月(1 月 - 12 月) 天气和作物产量数据集。此数据收集了 年(2002 - 2019)。我的目标是获得每个月温度对产量差距影响的 bootstrapped 斜率系数。在 bootstrapping 中,我想以一种方式阻止年份信息,即函数应在每个 bootstrap 中随机抽取特定年份的数据,而不是从混合年份中选择行。
我阅读了一些博客并尝试了不同的方法,但我对这些没有信心。我试图剖析 bootstrapped 拆分以确保我做的是否正确,但我没有。
这是起始代码:
# Load libraries
library(readxl)
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(reprex)
# data
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (3): ID, Location, Month
#> dbl (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date (1): Date
#>
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
ww_wt %>%
select(Year, Month, gap, temp) %>%
head()
#> # A tibble: 6 x 4
#> Year Month gap temp
#> <dbl> <chr> <dbl> <dbl>
#> 1 2002 September 0.282 13.6
#> 2 2002 October 0.282 13.3
#> 3 2002 November 0.282 7.07
#> 4 2002 December 0.282 3.44
#> 5 2002 January 0.282 5.61
#> 6 2002 February 0.282 6.93
# Bootstrapping
set.seed(123)
boots <- ww_wt %>%
ungroup() %>%
select(Year, Month, gap, temp) %>%
nest(data = -c(Month)) %>%
mutate(boots = map(data, ~bootstraps(.x, times = 100, apparent = FALSE))) %>%
unnest(boots) %>%
mutate(model = map(splits, ~lm(gap ~ temp, data = analysis(.))),
coefs = map(model, tidy))
由 reprex package (v2.0.1)
于 2022 年 1 月 4 日创建我正在嵌套 Months
因为我想分别获得每个月的坡度。此外,每年的数据具有不同的样本大小 n
,因为每年的地点数量不同。
我们目前不支持分组或阻止引导;我们正在追踪 interest in more group-based methods here.
如果你想创建一个重采样方案来保存整组数据,你可以查看 group_vfold_cv()
(也许与 nested_cv()
一起?)看看它是否符合你的需求与此同时。它导致重采样方案如下所示:
library(tidyverse)
library(tidymodels)
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): ID, Location, Month
#> dbl (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date (1): Date
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(123)
folds <-
ww_wt %>%
filter(Month == "September") %>%
select(Year, Month, gap, temp) %>%
group_vfold_cv(
group = Year,
v = 7
)
folds
#> # Group 7-fold cross-validation
#> # A tibble: 7 × 2
#> splits id
#> <list> <chr>
#> 1 <split [135/26]> Resample1
#> 2 <split [137/24]> Resample2
#> 3 <split [133/28]> Resample3
#> 4 <split [132/29]> Resample4
#> 5 <split [142/19]> Resample5
#> 6 <split [144/17]> Resample6
#> 7 <split [143/18]> Resample7
tidy(folds) %>%
ggplot(aes(x = Resample, y = Row, fill = Data)) +
geom_tile() + scale_fill_brewer()
由 reprex package (v2.0.1)
创建于 2022-01-07如果你愿意,你可以提高 v
,你可以先 Month
嵌套,每个月都这样做。
感谢朱莉娅提供这些提示。我想我已经通过添加一些额外的代码行解决了这个问题。
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(viridis)
#> Loading required package: viridisLite
#>
#> Attaching package: 'viridis'
#> The following object is masked from 'package:scales':
#>
#> viridis_pal
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (3): ID, Location, Month
#> dbl (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date (1): Date
#>
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(123)
# Block bootstrapping
boots <- ww_wt %>%
ungroup() %>%
select(Year, Month, gap, temp) %>%
pivot_wider(names_from = Month, values_from = temp, values_fn = mean) %>%
bootstraps(times = 10, apparent = FALSE) %>%
mutate(splits = map(splits, analysis)) %>%
unnest(splits) %>%
group_by(id) %>%
mutate(row = row_number()) %>%
pivot_longer(names_to = "Month", values_to = "temp", cols = September:August)
# Bootstraps
boots %>%
group_by(id) %>%
ggplot(aes(x = id, y = row, fill = Year)) +
geom_tile() +
scale_fill_viridis(option = "B", direction = 1) +
labs(x = NULL) +
facet_wrap(~Month) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
以上图表清楚地表明,在每个 bootstrap 中,来自每个 Year
的数据正在随机抽样并进行替换。
由 reprex package (v2.0.1)
于 2022-01-08 创建