step_num2factor() 用法 -- Tidymodel（配方包）

Question

好吧，我已经阅读了 step_num2factor 的函数参考，老实说，我并没有弄清楚如何正确使用它。

temp_names <- as.character(unique(sort(all_raw$MSSubClass)))

price_recipe <-
     recipe(SalePrice ~ . , data = train_raw) %>%
step_num2factor(MSSubClass, levels  = temp_names)


temp_rec <- prep(price_recipe, training = train_raw, strings_as_factors = FALSE) # temporary recipe
temp_data <- bake(temp_rec, new_data = all_raw) # temporary data

class(all_raw$MSSubClass)
# > col_double()

MSSubClass: Identifies the type of dwelling involved in the sale.

    20  1-STORY 1946 & NEWER ALL STYLES
    30  1-STORY 1945 & OLDER
    40  1-STORY W/FINISHED ATTIC ALL AGES
    45  1-1/2 STORY - UNFINISHED ALL AGES
    50  1-1/2 STORY FINISHED ALL AGES
    60  2-STORY 1946 & NEWER
    70  2-STORY 1945 & OLDER
    75  2-1/2 STORY ALL AGES
    80  SPLIT OR MULTI-LEVEL
    85  SPLIT FOYER
    90  DUPLEX - ALL STYLES AND AGES
   120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150  1-1/2 STORY PUD - ALL AGES
   160  2-STORY PUD - 1946 & NEWER
   180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190  2 FAMILY CONVERSION - ALL STYLES AND AGES

使用step后输出的数据temp_data$MSSubClass全是NA。 obs 保存为 20,30,40.... 190，我想转换为名称（或什至相同的数字，但作为无序因子）

如果您知道更多关于 step_num2factor 用法的博文或一些使用代码，我也很乐意看到。

完整的数据集由 kaggle 提供： kaggle data

提前致谢，

Answer 1

我不认为 step_num2factor() 最适合这个变量。再次查看帮助，请注意您需要提供一个 transform 参数，该参数可用于在确定级别之前修改数值。如果这个数据都是 10 的倍数，这会工作正常，但是你有一些值，比如 75 和 85，所以我认为你不想要那个。此方法步骤最适合 numeric/integer-ish 变量，您可以使用简单的函数更轻松地将这些变量转换为一组整数。

相反，我认为您应该考虑 step_mutate() 和对因子类型的简单强制转换：

library(tidyverse)
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#> 
#>     fixed
#> The following object is masked from 'package:stats':
#> 
#>     step

train_raw <- read_csv("~/Downloads/house-prices-advanced-regression-techniques/train.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_character(),
#>   Id = col_double(),
#>   MSSubClass = col_double(),
#>   LotFrontage = col_double(),
#>   LotArea = col_double(),
#>   OverallQual = col_double(),
#>   OverallCond = col_double(),
#>   YearBuilt = col_double(),
#>   YearRemodAdd = col_double(),
#>   MasVnrArea = col_double(),
#>   BsmtFinSF1 = col_double(),
#>   BsmtFinSF2 = col_double(),
#>   BsmtUnfSF = col_double(),
#>   TotalBsmtSF = col_double(),
#>   `1stFlrSF` = col_double(),
#>   `2ndFlrSF` = col_double(),
#>   LowQualFinSF = col_double(),
#>   GrLivArea = col_double(),
#>   BsmtFullBath = col_double(),
#>   BsmtHalfBath = col_double(),
#>   FullBath = col_double()
#>   # ... with 18 more columns
#> )
#> See spec(...) for full column specifications.

price_recipe <-
  recipe(SalePrice ~ ., data = train_raw) %>%
  step_mutate(MSSubClass = factor(MSSubClass))

juiced_price <- prep(price_recipe) %>%
  juice()

levels(juiced_price$MSSubClass)
#>  [1] "20"  "30"  "40"  "45"  "50"  "60"  "70"  "75"  "80"  "85"  "90"  "120"
#> [13] "160" "180" "190"

juiced_price %>%
  count(MSSubClass)
#> # A tibble: 15 x 2
#>    MSSubClass     n
#>    <fct>      <int>
#>  1 20           536
#>  2 30            69
#>  3 40             4
#>  4 45            12
#>  5 50           144
#>  6 60           299
#>  7 70            60
#>  8 75            16
#>  9 80            58
#> 10 85            20
#> 11 90            52
#> 12 120           87
#> 13 160           63
#> 14 180           10
#> 15 190           30

^{由 reprex package (v0.3.0)}

于 2020-05-03 创建

在我看来，这可以让您获得想要的因子水平。如果您想将 .txt 文件中的那些字符串（例如“1-STORY 1945 & OLDER”）保存为 new_levels 向量，您可以说 factor(MSSubClass, levels = new_levels).

step_num2factor() 用法 -- Tidymodel（配方包）

step_num2factor() Usage -- Tidymodel (Recipe Package)

r

tidyverse

r-recipes

tidymodels