如何将 `recipes::step_dummy()` 翻译成 `dplyr`/`tidyr` 代码?
How to translate `recipes::step_dummy()` to `dplyr`/`tidyr` code?
我正在尝试弄清楚 recipes
包中的 step_dummy()
w运行 如何收集数据。尽管此功能有一个 reference page,但我仍然无法理解如何使用我知道的“常规”tidyverse
工具来实现它。这是一些基于 recipes
和 rsample
包的代码。我想实现相同的数据输出,但只使用 dplyr
/tidyr
工具。
我从 ggplot2
中选择 diamonds
数据集用于此演示。
library(rsample)
library(recipes)
my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split <- initial_split(my_diamonds, prop = .1)
d_training <- training(init_split)
d_training_dummied_using_recipe <-
recipe(formula = price ~ ., data = d_training) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL) # equivalent to `juice()`. It means to get the training data (`d_training`) after the steps in the recipe were applied to it.
d_training_dummied_using_recipe
#> # A tibble: 5,394 x 6
#> carat price cut_1 cut_2 cut_3 cut_4
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 0.5 1678 -0.316 -0.267 6.32e- 1 -0.478
#> 2 0.7 2608 -0.316 -0.267 6.32e- 1 -0.478
#> 3 1.7 9996 0.316 -0.267 -6.32e- 1 -0.478
#> 4 0.73 1824 0.316 -0.267 -6.32e- 1 -0.478
#> 5 0.4 988 0.632 0.535 3.16e- 1 0.120
#> 6 1.04 4240 0.316 -0.267 -6.32e- 1 -0.478
#> 7 0.9 3950 0 -0.535 -4.10e-16 0.717
#> 8 0.4 1116 0 -0.535 -4.10e-16 0.717
#> 9 1.34 10070 0.632 0.535 3.16e- 1 0.120
#> 10 0.6 806 0.316 -0.267 -6.32e- 1 -0.478
#> # ... with 5,384 more rows
我的问题是,给定 d_training
,我们如何通过使用 dplyr
或 tidyr
(可能 forcats
) 职能?我看过类似 的帖子,但它们似乎不符合当前情况。
编辑
显然,step_dummy()
仅对 cut
列进行操作,这是因为我们指定了 all_nominal()
。事实上,cut
是 d_training
中唯一的名义变量。我认为 cut_*
列对应于 cut
的级别,但后来我 运行:
levels(d_training$cut)
#> [1] "Fair" "Good" "Very Good" "Premium" "Ideal"
显示 6 个级别,而只有 4 个 cut_*
列。所以这是理解正在发生的事情的一个限制。
另外,cut_*
中的那些值是如何生成的?
编辑 2
我遇到了最相关的小插图 How are categorical predictors handled in recipes?,它直接讨论了主题。
A contrast function in R is a method for translating a column with categorical values into one or more numeric columns that take the place of the original. This can also be known as an encoding method or a parameterization function.
The default approach is to create dummy variables using the “reference cell” parameterization. This means that, if there are C levels of the factor, there will be C - 1 dummy variables created and all but the first factor level are made into new columns
关于级别数与 cut_*
列数,小插图明确表示:
Note that the column names do not reference a specific level of the [...] variable. This contrast function has columns that can involve multiple levels; level-specific columns wouldn’t make sense.
但最终没有示例如何使用常规工具(不在 recipes
上下文中)执行相同的操作。所以我原来的问题仍然没有解决。
这只是回答的一半,但这应该可以帮助您了解 cut_*
列的映射方式。试试这个 link 以获得更详细的外观:https://recipes.tidymodels.org/articles/Dummies.html
library(tidyverse)
library(recipes)
diamonds |>
select(carat, cut, price) |>
mutate(original = cut) |>
(\(d) recipe(formula = price ~ ., data = d))() |>
step_dummy(cut) |>
prep()|>
bake(new_data = NULL, original, starts_with("cut")) |>
distinct()
#> # A tibble: 5 x 5
#> original cut_1 cut_2 cut_3 cut_4
#> <ord> <dbl> <dbl> <dbl> <dbl>
#> 1 Ideal 0.632 0.535 3.16e- 1 0.120
#> 2 Premium 0.316 -0.267 -6.32e- 1 -0.478
#> 3 Good -0.316 -0.267 6.32e- 1 -0.478
#> 4 Very Good 0 -0.535 -4.10e-16 0.717
#> 5 Fair -0.632 0.535 -3.16e- 1 0.120
编辑:
这里有更多细节:
contr.poly(levels(diamonds$cut))
#> .L .Q .C ^4
#> [1,] -0.6324555 0.5345225 -3.162278e-01 0.1195229
#> [2,] -0.3162278 -0.2672612 6.324555e-01 -0.4780914
#> [3,] 0.0000000 -0.5345225 -4.095972e-16 0.7171372
#> [4,] 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
#> [5,] 0.6324555 0.5345225 3.162278e-01 0.1195229
cut_*
列表示来自 contr.poly
的映射和切割级别。注意切割列与 contr.poly
矩阵的相同之处。
你可以看看source code for step_dummy()
;我不确定我会称它为黑匣子本身。请注意,在 bake()
期间,它使用基数 R.
中的 model.matrix()
library(rsample)
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
data(diamonds, package = "ggplot2")
my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split <- initial_split(my_diamonds, prop = .1)
d_training <- training(init_split)
d_training_dummied_using_recipe <-
recipe(formula = price ~ ., data = d_training) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL)
d_training_dummied_using_recipe
#> # A tibble: 5,394 × 6
#> carat price cut_1 cut_2 cut_3 cut_4
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 0.31 544 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 2 0.72 3294 6.32e- 1 0.535 3.16e- 1 0.120
#> 3 0.7 2257 -1.48e-18 -0.535 -3.89e-16 0.717
#> 4 0.5 1446 6.32e- 1 0.535 3.16e- 1 0.120
#> 5 0.31 772 6.32e- 1 0.535 3.16e- 1 0.120
#> 6 1.01 3733 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 7 0.31 942 6.32e- 1 0.535 3.16e- 1 0.120
#> 8 0.43 903 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 9 1.21 4391 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 10 1.37 5370 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> # … with 5,384 more rows
model.matrix(price ~ .,
data = d_training) %>%
as_tibble()
#> # A tibble: 5,394 × 6
#> `(Intercept)` carat cut.L cut.Q cut.C `cut^4`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.31 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 2 1 0.72 6.32e- 1 0.535 3.16e- 1 0.120
#> 3 1 0.7 -1.48e-18 -0.535 -3.89e-16 0.717
#> 4 1 0.5 6.32e- 1 0.535 3.16e- 1 0.120
#> 5 1 0.31 6.32e- 1 0.535 3.16e- 1 0.120
#> 6 1 1.01 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 7 1 0.31 6.32e- 1 0.535 3.16e- 1 0.120
#> 8 1 0.43 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 9 1 1.21 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 10 1 1.37 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> # … with 5,384 more rows
由 reprex package (v2.0.1)
于 2021-12-30 创建
创建这些指标变量的配方实施为从训练数据中学习和应用于新数据或测试数据以及更标准的命名等设置了一些保护和便利。这可能是一个特别令人困惑的例子,因为cut
是有序因子。
我正在尝试弄清楚 recipes
包中的 step_dummy()
w运行 如何收集数据。尽管此功能有一个 reference page,但我仍然无法理解如何使用我知道的“常规”tidyverse
工具来实现它。这是一些基于 recipes
和 rsample
包的代码。我想实现相同的数据输出,但只使用 dplyr
/tidyr
工具。
我从 ggplot2
中选择 diamonds
数据集用于此演示。
library(rsample)
library(recipes)
my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split <- initial_split(my_diamonds, prop = .1)
d_training <- training(init_split)
d_training_dummied_using_recipe <-
recipe(formula = price ~ ., data = d_training) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL) # equivalent to `juice()`. It means to get the training data (`d_training`) after the steps in the recipe were applied to it.
d_training_dummied_using_recipe
#> # A tibble: 5,394 x 6
#> carat price cut_1 cut_2 cut_3 cut_4
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 0.5 1678 -0.316 -0.267 6.32e- 1 -0.478
#> 2 0.7 2608 -0.316 -0.267 6.32e- 1 -0.478
#> 3 1.7 9996 0.316 -0.267 -6.32e- 1 -0.478
#> 4 0.73 1824 0.316 -0.267 -6.32e- 1 -0.478
#> 5 0.4 988 0.632 0.535 3.16e- 1 0.120
#> 6 1.04 4240 0.316 -0.267 -6.32e- 1 -0.478
#> 7 0.9 3950 0 -0.535 -4.10e-16 0.717
#> 8 0.4 1116 0 -0.535 -4.10e-16 0.717
#> 9 1.34 10070 0.632 0.535 3.16e- 1 0.120
#> 10 0.6 806 0.316 -0.267 -6.32e- 1 -0.478
#> # ... with 5,384 more rows
我的问题是,给定 d_training
,我们如何通过使用 dplyr
或 tidyr
(可能 forcats
) 职能?我看过类似
编辑
显然,step_dummy()
仅对 cut
列进行操作,这是因为我们指定了 all_nominal()
。事实上,cut
是 d_training
中唯一的名义变量。我认为 cut_*
列对应于 cut
的级别,但后来我 运行:
levels(d_training$cut)
#> [1] "Fair" "Good" "Very Good" "Premium" "Ideal"
显示 6 个级别,而只有 4 个 cut_*
列。所以这是理解正在发生的事情的一个限制。
另外,cut_*
中的那些值是如何生成的?
编辑 2
我遇到了最相关的小插图 How are categorical predictors handled in recipes?,它直接讨论了主题。
A contrast function in R is a method for translating a column with categorical values into one or more numeric columns that take the place of the original. This can also be known as an encoding method or a parameterization function.
The default approach is to create dummy variables using the “reference cell” parameterization. This means that, if there are C levels of the factor, there will be C - 1 dummy variables created and all but the first factor level are made into new columns
关于级别数与 cut_*
列数,小插图明确表示:
Note that the column names do not reference a specific level of the [...] variable. This contrast function has columns that can involve multiple levels; level-specific columns wouldn’t make sense.
但最终没有示例如何使用常规工具(不在 recipes
上下文中)执行相同的操作。所以我原来的问题仍然没有解决。
这只是回答的一半,但这应该可以帮助您了解 cut_*
列的映射方式。试试这个 link 以获得更详细的外观:https://recipes.tidymodels.org/articles/Dummies.html
library(tidyverse)
library(recipes)
diamonds |>
select(carat, cut, price) |>
mutate(original = cut) |>
(\(d) recipe(formula = price ~ ., data = d))() |>
step_dummy(cut) |>
prep()|>
bake(new_data = NULL, original, starts_with("cut")) |>
distinct()
#> # A tibble: 5 x 5
#> original cut_1 cut_2 cut_3 cut_4
#> <ord> <dbl> <dbl> <dbl> <dbl>
#> 1 Ideal 0.632 0.535 3.16e- 1 0.120
#> 2 Premium 0.316 -0.267 -6.32e- 1 -0.478
#> 3 Good -0.316 -0.267 6.32e- 1 -0.478
#> 4 Very Good 0 -0.535 -4.10e-16 0.717
#> 5 Fair -0.632 0.535 -3.16e- 1 0.120
编辑:
这里有更多细节:
contr.poly(levels(diamonds$cut))
#> .L .Q .C ^4
#> [1,] -0.6324555 0.5345225 -3.162278e-01 0.1195229
#> [2,] -0.3162278 -0.2672612 6.324555e-01 -0.4780914
#> [3,] 0.0000000 -0.5345225 -4.095972e-16 0.7171372
#> [4,] 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
#> [5,] 0.6324555 0.5345225 3.162278e-01 0.1195229
cut_*
列表示来自 contr.poly
的映射和切割级别。注意切割列与 contr.poly
矩阵的相同之处。
你可以看看source code for step_dummy()
;我不确定我会称它为黑匣子本身。请注意,在 bake()
期间,它使用基数 R.
model.matrix()
library(rsample)
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
data(diamonds, package = "ggplot2")
my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split <- initial_split(my_diamonds, prop = .1)
d_training <- training(init_split)
d_training_dummied_using_recipe <-
recipe(formula = price ~ ., data = d_training) %>%
step_dummy(all_nominal()) %>%
prep() %>%
bake(new_data = NULL)
d_training_dummied_using_recipe
#> # A tibble: 5,394 × 6
#> carat price cut_1 cut_2 cut_3 cut_4
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 0.31 544 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 2 0.72 3294 6.32e- 1 0.535 3.16e- 1 0.120
#> 3 0.7 2257 -1.48e-18 -0.535 -3.89e-16 0.717
#> 4 0.5 1446 6.32e- 1 0.535 3.16e- 1 0.120
#> 5 0.31 772 6.32e- 1 0.535 3.16e- 1 0.120
#> 6 1.01 3733 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 7 0.31 942 6.32e- 1 0.535 3.16e- 1 0.120
#> 8 0.43 903 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 9 1.21 4391 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 10 1.37 5370 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> # … with 5,384 more rows
model.matrix(price ~ .,
data = d_training) %>%
as_tibble()
#> # A tibble: 5,394 × 6
#> `(Intercept)` carat cut.L cut.Q cut.C `cut^4`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.31 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 2 1 0.72 6.32e- 1 0.535 3.16e- 1 0.120
#> 3 1 0.7 -1.48e-18 -0.535 -3.89e-16 0.717
#> 4 1 0.5 6.32e- 1 0.535 3.16e- 1 0.120
#> 5 1 0.31 6.32e- 1 0.535 3.16e- 1 0.120
#> 6 1 1.01 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 7 1 0.31 6.32e- 1 0.535 3.16e- 1 0.120
#> 8 1 0.43 -3.16e- 1 -0.267 6.32e- 1 -0.478
#> 9 1 1.21 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 10 1 1.37 3.16e- 1 -0.267 -6.32e- 1 -0.478
#> # … with 5,384 more rows
由 reprex package (v2.0.1)
于 2021-12-30 创建创建这些指标变量的配方实施为从训练数据中学习和应用于新数据或测试数据以及更标准的命名等设置了一些保护和便利。这可能是一个特别令人困惑的例子,因为cut
是有序因子。