如何将 `recipes::step_dummy()` 翻译成 `dplyr`/`tidyr` 代码?

How to translate `recipes::step_dummy()` to `dplyr`/`tidyr` code?

我正在尝试弄清楚 recipes 包中的 step_dummy() w运行 如何收集数据。尽管此功能有一个 reference page,但我仍然无法理解如何使用我知道的“常规”tidyverse 工具来实现它。这是一些基于 recipesrsample 包的代码。我想实现相同的数据输出,但只使用 dplyr/tidyr 工具。

我从 ggplot2 中选择 diamonds 数据集用于此演示。

library(rsample)
library(recipes)

my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split  <- initial_split(my_diamonds, prop = .1)
d_training  <- training(init_split)

d_training_dummied_using_recipe <-
  recipe(formula = price ~ ., data = d_training) %>%
  step_dummy(all_nominal()) %>% 
  prep() %>%
  bake(new_data = NULL) # equivalent to `juice()`. It means to get the training data (`d_training`) after the steps in the recipe were applied to it.

d_training_dummied_using_recipe
#> # A tibble: 5,394 x 6
#>    carat price  cut_1  cut_2     cut_3  cut_4
#>    <dbl> <int>  <dbl>  <dbl>     <dbl>  <dbl>
#>  1  0.5   1678 -0.316 -0.267  6.32e- 1 -0.478
#>  2  0.7   2608 -0.316 -0.267  6.32e- 1 -0.478
#>  3  1.7   9996  0.316 -0.267 -6.32e- 1 -0.478
#>  4  0.73  1824  0.316 -0.267 -6.32e- 1 -0.478
#>  5  0.4    988  0.632  0.535  3.16e- 1  0.120
#>  6  1.04  4240  0.316 -0.267 -6.32e- 1 -0.478
#>  7  0.9   3950  0     -0.535 -4.10e-16  0.717
#>  8  0.4   1116  0     -0.535 -4.10e-16  0.717
#>  9  1.34 10070  0.632  0.535  3.16e- 1  0.120
#> 10  0.6    806  0.316 -0.267 -6.32e- 1 -0.478
#> # ... with 5,384 more rows

我的问题是,给定 d_training,我们如何通过使用 dplyrtidyr(可能 forcats) 职能?我看过类似 的帖子,但它们似乎不符合当前情况。


编辑


显然,step_dummy() 仅对 cut 列进行操作,这是因为我们指定了 all_nominal()。事实上,cutd_training 中唯一的名义变量。我认为 cut_* 列对应于 cut 的级别,但后来我 运行:

levels(d_training$cut)
#> [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"  

显示 6 个级别,而只有 4 个 cut_* 列。所以这是理解正在发生的事情的一个限制。 另外,cut_*中的那些值是如何生成的?


编辑 2


我遇到了最相关的小插图 How are categorical predictors handled in recipes?,它直接讨论了主题。

A contrast function in R is a method for translating a column with categorical values into one or more numeric columns that take the place of the original. This can also be known as an encoding method or a parameterization function.

The default approach is to create dummy variables using the “reference cell” parameterization. This means that, if there are C levels of the factor, there will be C - 1 dummy variables created and all but the first factor level are made into new columns

关于级别数与 cut_* 列数,小插图明确表示:

Note that the column names do not reference a specific level of the [...] variable. This contrast function has columns that can involve multiple levels; level-specific columns wouldn’t make sense.

但最终没有示例如何使用常规工具(不在 recipes 上下文中)执行相同的操作。所以我原来的问题仍然没有解决。

这只是回答的一半,但这应该可以帮助您了解 cut_* 列的映射方式。试试这个 link 以获得更详细的外观:https://recipes.tidymodels.org/articles/Dummies.html

library(tidyverse)
library(recipes)


diamonds |> 
  select(carat, cut, price) |>
  mutate(original = cut) |>
  (\(d) recipe(formula = price ~ ., data = d))() |>
  step_dummy(cut) |>
  prep()|>
  bake(new_data = NULL, original, starts_with("cut")) |>
  distinct() 
#> # A tibble: 5 x 5
#>   original   cut_1  cut_2     cut_3  cut_4
#>   <ord>      <dbl>  <dbl>     <dbl>  <dbl>
#> 1 Ideal      0.632  0.535  3.16e- 1  0.120
#> 2 Premium    0.316 -0.267 -6.32e- 1 -0.478
#> 3 Good      -0.316 -0.267  6.32e- 1 -0.478
#> 4 Very Good  0     -0.535 -4.10e-16  0.717
#> 5 Fair      -0.632  0.535 -3.16e- 1  0.120

编辑:

这里有更多细节:

contr.poly(levels(diamonds$cut))
#>              .L         .Q            .C         ^4
#> [1,] -0.6324555  0.5345225 -3.162278e-01  0.1195229
#> [2,] -0.3162278 -0.2672612  6.324555e-01 -0.4780914
#> [3,]  0.0000000 -0.5345225 -4.095972e-16  0.7171372
#> [4,]  0.3162278 -0.2672612 -6.324555e-01 -0.4780914
#> [5,]  0.6324555  0.5345225  3.162278e-01  0.1195229

cut_* 列表示来自 contr.poly 的映射和切割级别。注意切割列与 contr.poly 矩阵的相同之处。

你可以看看source code for step_dummy();我不确定我会称它为黑匣子本身。请注意,在 bake() 期间,它使用基数 R.

中的 model.matrix()
library(rsample)
library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

data(diamonds, package = "ggplot2")

my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split  <- initial_split(my_diamonds, prop = .1)
d_training  <- training(init_split)

d_training_dummied_using_recipe <-
  recipe(formula = price ~ ., data = d_training) %>%
  step_dummy(all_nominal()) %>% 
  prep() %>%
  bake(new_data = NULL) 

d_training_dummied_using_recipe
#> # A tibble: 5,394 × 6
#>    carat price     cut_1  cut_2     cut_3  cut_4
#>    <dbl> <int>     <dbl>  <dbl>     <dbl>  <dbl>
#>  1  0.31   544 -3.16e- 1 -0.267  6.32e- 1 -0.478
#>  2  0.72  3294  6.32e- 1  0.535  3.16e- 1  0.120
#>  3  0.7   2257 -1.48e-18 -0.535 -3.89e-16  0.717
#>  4  0.5   1446  6.32e- 1  0.535  3.16e- 1  0.120
#>  5  0.31   772  6.32e- 1  0.535  3.16e- 1  0.120
#>  6  1.01  3733  3.16e- 1 -0.267 -6.32e- 1 -0.478
#>  7  0.31   942  6.32e- 1  0.535  3.16e- 1  0.120
#>  8  0.43   903 -3.16e- 1 -0.267  6.32e- 1 -0.478
#>  9  1.21  4391  3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 10  1.37  5370  3.16e- 1 -0.267 -6.32e- 1 -0.478
#> # … with 5,384 more rows


model.matrix(price ~ .,
             data = d_training) %>%
  as_tibble()
#> # A tibble: 5,394 × 6
#>    `(Intercept)` carat     cut.L  cut.Q     cut.C `cut^4`
#>            <dbl> <dbl>     <dbl>  <dbl>     <dbl>   <dbl>
#>  1             1  0.31 -3.16e- 1 -0.267  6.32e- 1  -0.478
#>  2             1  0.72  6.32e- 1  0.535  3.16e- 1   0.120
#>  3             1  0.7  -1.48e-18 -0.535 -3.89e-16   0.717
#>  4             1  0.5   6.32e- 1  0.535  3.16e- 1   0.120
#>  5             1  0.31  6.32e- 1  0.535  3.16e- 1   0.120
#>  6             1  1.01  3.16e- 1 -0.267 -6.32e- 1  -0.478
#>  7             1  0.31  6.32e- 1  0.535  3.16e- 1   0.120
#>  8             1  0.43 -3.16e- 1 -0.267  6.32e- 1  -0.478
#>  9             1  1.21  3.16e- 1 -0.267 -6.32e- 1  -0.478
#> 10             1  1.37  3.16e- 1 -0.267 -6.32e- 1  -0.478
#> # … with 5,384 more rows

reprex package (v2.0.1)

于 2021-12-30 创建

创建这些指标变量的配方实施为从训练数据中学习和应用于新数据或测试数据以及更标准的命名等设置了一些保护和便利。这可能是一个特别令人困惑的例子,因为cut 是有序因子。