tidymodels 食谱:我可以使用 step_dummy() 来一次性编码分类变量*除了*布尔值,它只需要 1 个虚拟变量吗?
tidymodels recipes: can I use step_dummy() to one-hot encode the categorical variabes *except* booleans which only needs 1 dummy?
如果分类变量有 2 个以上的值(如婚姻状况= single/married/widowed/separated/divorced),那么我需要创建 N 个虚拟变量,每个变量对应一个可能的级别。这是使用 step_dummy(one_hot = TRUE).
完成的
但是,如果类别是二进制的(pokemon_fan =“是”/“否”),那么我只需要创建一个名为“pokemon_fan_yes”的虚拟对象。这是使用 step_dummy(one_hot = FALSE).
完成的
step_dummy是否可以计算关卡的数量并根据该数量进行不同的处理?
谢谢。
在食谱本身中没有自动执行此操作的方法,但我认为您可以创建一个函数来为您处理此问题,如下所示:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
data(crickets, package = "modeldata")
levels_more_than <- function(vec, num = 2) {
n_distinct(levels(vec)) > num
}
recipe(~ ., data = crickets) %>%
step_dummy(species, one_hot = !! levels_more_than(crickets$species)) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 31 × 3
#> temp rate species_O..niveus
#> <dbl> <dbl> <dbl>
#> 1 20.8 67.9 0
#> 2 20.8 65.1 0
#> 3 24 77.3 0
#> 4 24 78.7 0
#> 5 24 79.4 0
#> 6 24 80.4 0
#> 7 26.2 85.8 0
#> 8 26.2 86.6 0
#> 9 26.2 87.5 0
#> 10 26.2 89.1 0
#> # … with 21 more rows
recipe(~ ., data = iris) %>%
step_dummy(Species, one_hot = !! levels_more_than(iris$Species)) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 150 × 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2 1
#> 2 4.9 3 1.4 0.2 1
#> 3 4.7 3.2 1.3 0.2 1
#> 4 4.6 3.1 1.5 0.2 1
#> 5 5 3.6 1.4 0.2 1
#> 6 5.4 3.9 1.7 0.4 1
#> 7 4.6 3.4 1.4 0.3 1
#> 8 5 3.4 1.5 0.2 1
#> 9 4.4 2.9 1.4 0.2 1
#> 10 4.9 3.1 1.5 0.1 1
#> # … with 140 more rows, and 2 more variables: Species_versicolor <dbl>,
#> # Species_virginica <dbl>
由 reprex package (v2.0.1)
创建于 2022-02-23
这里是some tips for using not-quite-standard selectors in recipes.
如果分类变量有 2 个以上的值(如婚姻状况= single/married/widowed/separated/divorced),那么我需要创建 N 个虚拟变量,每个变量对应一个可能的级别。这是使用 step_dummy(one_hot = TRUE).
完成的但是,如果类别是二进制的(pokemon_fan =“是”/“否”),那么我只需要创建一个名为“pokemon_fan_yes”的虚拟对象。这是使用 step_dummy(one_hot = FALSE).
完成的step_dummy是否可以计算关卡的数量并根据该数量进行不同的处理?
谢谢。
在食谱本身中没有自动执行此操作的方法,但我认为您可以创建一个函数来为您处理此问题,如下所示:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
data(crickets, package = "modeldata")
levels_more_than <- function(vec, num = 2) {
n_distinct(levels(vec)) > num
}
recipe(~ ., data = crickets) %>%
step_dummy(species, one_hot = !! levels_more_than(crickets$species)) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 31 × 3
#> temp rate species_O..niveus
#> <dbl> <dbl> <dbl>
#> 1 20.8 67.9 0
#> 2 20.8 65.1 0
#> 3 24 77.3 0
#> 4 24 78.7 0
#> 5 24 79.4 0
#> 6 24 80.4 0
#> 7 26.2 85.8 0
#> 8 26.2 86.6 0
#> 9 26.2 87.5 0
#> 10 26.2 89.1 0
#> # … with 21 more rows
recipe(~ ., data = iris) %>%
step_dummy(Species, one_hot = !! levels_more_than(iris$Species)) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 150 × 7
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species_setosa
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2 1
#> 2 4.9 3 1.4 0.2 1
#> 3 4.7 3.2 1.3 0.2 1
#> 4 4.6 3.1 1.5 0.2 1
#> 5 5 3.6 1.4 0.2 1
#> 6 5.4 3.9 1.7 0.4 1
#> 7 4.6 3.4 1.4 0.3 1
#> 8 5 3.4 1.5 0.2 1
#> 9 4.4 2.9 1.4 0.2 1
#> 10 4.9 3.1 1.5 0.1 1
#> # … with 140 more rows, and 2 more variables: Species_versicolor <dbl>,
#> # Species_virginica <dbl>
由 reprex package (v2.0.1)
创建于 2022-02-23这里是some tips for using not-quite-standard selectors in recipes.