使用 Tidymodels 阻止引导

Block Bootstrapping using Tidymodels

我有一个每月(1 月 - 12 月) 天气和作物产量数据集。此数据收集了 年(2002 - 2019)。我的目标是获得每个月温度对产量差距影响的 bootstrapped 斜率系数。在 bootstrapping 中,我想以一种方式阻止年份信息,即函数应在每个 bootstrap 中随机抽取特定年份的数据,而不是从混合年份中选择行。

我阅读了一些博客并尝试了不同的方法,但我对这些没有信心。我试图剖析 bootstrapped 拆分以确保我做的是否正确,但我没有。

这是起始代码:

# Load libraries
library(readxl)
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(reprex)

# data 
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr   (3): ID, Location, Month
#> dbl  (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date  (1): Date
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

ww_wt %>% 
  select(Year, Month, gap, temp) %>%
  head()
#> # A tibble: 6 x 4
#>    Year Month       gap  temp
#>   <dbl> <chr>     <dbl> <dbl>
#> 1  2002 September 0.282 13.6 
#> 2  2002 October   0.282 13.3 
#> 3  2002 November  0.282  7.07
#> 4  2002 December  0.282  3.44
#> 5  2002 January   0.282  5.61
#> 6  2002 February  0.282  6.93

# Bootstrapping
set.seed(123)

boots <- ww_wt %>% 
  ungroup() %>% 
  select(Year, Month, gap, temp) %>%
  nest(data = -c(Month)) %>% 
  mutate(boots = map(data, ~bootstraps(.x, times = 100, apparent = FALSE))) %>%
  unnest(boots) %>% 
  mutate(model = map(splits, ~lm(gap ~ temp, data = analysis(.))),
         coefs = map(model, tidy))

reprex package (v2.0.1)

于 2022 年 1 月 4 日创建

我正在嵌套 Months 因为我想分别获得每个月的坡度。此外,每年的数据具有不同的样本大小 n,因为每年的地点数量不同。

我们目前不支持分组或阻止引导;我们正在追踪 interest in more group-based methods here.

如果你想创建一个重采样方案来保存整组数据,你可以查看 group_vfold_cv()(也许与 nested_cv() 一起?)看看它是否符合你的需求与此同时。它导致重采样方案如下所示:

library(tidyverse)
library(tidymodels)

ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr   (3): ID, Location, Month
#> dbl  (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date  (1): Date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

set.seed(123)
folds <-
  ww_wt %>%
  filter(Month == "September") %>%
  select(Year, Month, gap, temp) %>%
  group_vfold_cv(
    group = Year,
    v = 7
  )

folds
#> # Group 7-fold cross-validation 
#> # A tibble: 7 × 2
#>   splits           id       
#>   <list>           <chr>    
#> 1 <split [135/26]> Resample1
#> 2 <split [137/24]> Resample2
#> 3 <split [133/28]> Resample3
#> 4 <split [132/29]> Resample4
#> 5 <split [142/19]> Resample5
#> 6 <split [144/17]> Resample6
#> 7 <split [143/18]> Resample7

tidy(folds) %>%
  ggplot(aes(x = Resample, y = Row, fill = Data)) +
  geom_tile() + scale_fill_brewer()

reprex package (v2.0.1)

创建于 2022-01-07

如果你愿意,你可以提高 v,你可以先 Month 嵌套,每个月都这样做。

感谢朱莉娅提供这些提示。我想我已经通过添加一些额外的代码行解决了这个问题。

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(viridis)
#> Loading required package: viridisLite
#> 
#> Attaching package: 'viridis'
#> The following object is masked from 'package:scales':
#> 
#>     viridis_pal

ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr   (3): ID, Location, Month
#> dbl  (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date  (1): Date
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

set.seed(123)

# Block bootstrapping
boots <- ww_wt %>% 
  ungroup() %>% 
  select(Year, Month, gap, temp) %>%
  pivot_wider(names_from = Month, values_from = temp, values_fn = mean) %>% 
  bootstraps(times = 10, apparent = FALSE) %>% 
  mutate(splits = map(splits, analysis)) %>%
  unnest(splits) %>% 
  group_by(id) %>% 
  mutate(row = row_number()) %>% 
  pivot_longer(names_to = "Month", values_to = "temp", cols = September:August)

# Bootstraps
boots %>%
  group_by(id) %>% 
  ggplot(aes(x = id, y = row, fill = Year)) +
  geom_tile() +
  scale_fill_viridis(option = "B", direction = 1) +
  labs(x = NULL) +
  facet_wrap(~Month) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))

以上图表清楚地表明,在每个 bootstrap 中,来自每个 Year 的数据正在随机抽样并进行替换。

reprex package (v2.0.1)

于 2022-01-08 创建