step_num2factor() 用法 -- Tidymodel(配方包)
step_num2factor() Usage -- Tidymodel (Recipe Package)
好吧,我已经阅读了 step_num2factor 的函数参考,老实说,我并没有弄清楚如何正确使用它。
temp_names <- as.character(unique(sort(all_raw$MSSubClass)))
price_recipe <-
recipe(SalePrice ~ . , data = train_raw) %>%
step_num2factor(MSSubClass, levels = temp_names)
temp_rec <- prep(price_recipe, training = train_raw, strings_as_factors = FALSE) # temporary recipe
temp_data <- bake(temp_rec, new_data = all_raw) # temporary data
class(all_raw$MSSubClass)
# > col_double()
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
使用step后输出的数据temp_data$MSSubClass
全是NA。
obs 保存为 20,30,40.... 190,我想转换为名称(或什至相同的数字,但作为无序因子)
如果您知道更多关于 step_num2factor 用法的博文或一些使用代码,我也很乐意看到。
完整的数据集由 kaggle 提供:
kaggle data
提前致谢,
我不认为 step_num2factor()
最适合这个变量。再次查看帮助,请注意您需要提供一个 transform
参数,该参数可用于在确定级别之前修改数值。如果这个数据都是 10 的倍数,这会工作正常,但是你有一些值,比如 75 和 85,所以我认为你不想要那个。此方法步骤最适合 numeric/integer-ish 变量,您可以使用简单的函数更轻松地将这些变量转换为一组整数。
相反,我认为您应该考虑 step_mutate()
和对因子类型的简单强制转换:
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
train_raw <- read_csv("~/Downloads/house-prices-advanced-regression-techniques/train.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_character(),
#> Id = col_double(),
#> MSSubClass = col_double(),
#> LotFrontage = col_double(),
#> LotArea = col_double(),
#> OverallQual = col_double(),
#> OverallCond = col_double(),
#> YearBuilt = col_double(),
#> YearRemodAdd = col_double(),
#> MasVnrArea = col_double(),
#> BsmtFinSF1 = col_double(),
#> BsmtFinSF2 = col_double(),
#> BsmtUnfSF = col_double(),
#> TotalBsmtSF = col_double(),
#> `1stFlrSF` = col_double(),
#> `2ndFlrSF` = col_double(),
#> LowQualFinSF = col_double(),
#> GrLivArea = col_double(),
#> BsmtFullBath = col_double(),
#> BsmtHalfBath = col_double(),
#> FullBath = col_double()
#> # ... with 18 more columns
#> )
#> See spec(...) for full column specifications.
price_recipe <-
recipe(SalePrice ~ ., data = train_raw) %>%
step_mutate(MSSubClass = factor(MSSubClass))
juiced_price <- prep(price_recipe) %>%
juice()
levels(juiced_price$MSSubClass)
#> [1] "20" "30" "40" "45" "50" "60" "70" "75" "80" "85" "90" "120"
#> [13] "160" "180" "190"
juiced_price %>%
count(MSSubClass)
#> # A tibble: 15 x 2
#> MSSubClass n
#> <fct> <int>
#> 1 20 536
#> 2 30 69
#> 3 40 4
#> 4 45 12
#> 5 50 144
#> 6 60 299
#> 7 70 60
#> 8 75 16
#> 9 80 58
#> 10 85 20
#> 11 90 52
#> 12 120 87
#> 13 160 63
#> 14 180 10
#> 15 190 30
由 reprex package (v0.3.0)
于 2020-05-03 创建
在我看来,这可以让您获得想要的因子水平。如果您想将 .txt
文件中的那些字符串(例如“1-STORY 1945 & OLDER”)保存为 new_levels
向量,您可以说 factor(MSSubClass, levels = new_levels)
.
好吧,我已经阅读了 step_num2factor 的函数参考,老实说,我并没有弄清楚如何正确使用它。
temp_names <- as.character(unique(sort(all_raw$MSSubClass)))
price_recipe <-
recipe(SalePrice ~ . , data = train_raw) %>%
step_num2factor(MSSubClass, levels = temp_names)
temp_rec <- prep(price_recipe, training = train_raw, strings_as_factors = FALSE) # temporary recipe
temp_data <- bake(temp_rec, new_data = all_raw) # temporary data
class(all_raw$MSSubClass)
# > col_double()
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
使用step后输出的数据temp_data$MSSubClass
全是NA。
obs 保存为 20,30,40.... 190,我想转换为名称(或什至相同的数字,但作为无序因子)
如果您知道更多关于 step_num2factor 用法的博文或一些使用代码,我也很乐意看到。
完整的数据集由 kaggle 提供: kaggle data
提前致谢,
我不认为 step_num2factor()
最适合这个变量。再次查看帮助,请注意您需要提供一个 transform
参数,该参数可用于在确定级别之前修改数值。如果这个数据都是 10 的倍数,这会工作正常,但是你有一些值,比如 75 和 85,所以我认为你不想要那个。此方法步骤最适合 numeric/integer-ish 变量,您可以使用简单的函数更轻松地将这些变量转换为一组整数。
相反,我认为您应该考虑 step_mutate()
和对因子类型的简单强制转换:
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
train_raw <- read_csv("~/Downloads/house-prices-advanced-regression-techniques/train.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_character(),
#> Id = col_double(),
#> MSSubClass = col_double(),
#> LotFrontage = col_double(),
#> LotArea = col_double(),
#> OverallQual = col_double(),
#> OverallCond = col_double(),
#> YearBuilt = col_double(),
#> YearRemodAdd = col_double(),
#> MasVnrArea = col_double(),
#> BsmtFinSF1 = col_double(),
#> BsmtFinSF2 = col_double(),
#> BsmtUnfSF = col_double(),
#> TotalBsmtSF = col_double(),
#> `1stFlrSF` = col_double(),
#> `2ndFlrSF` = col_double(),
#> LowQualFinSF = col_double(),
#> GrLivArea = col_double(),
#> BsmtFullBath = col_double(),
#> BsmtHalfBath = col_double(),
#> FullBath = col_double()
#> # ... with 18 more columns
#> )
#> See spec(...) for full column specifications.
price_recipe <-
recipe(SalePrice ~ ., data = train_raw) %>%
step_mutate(MSSubClass = factor(MSSubClass))
juiced_price <- prep(price_recipe) %>%
juice()
levels(juiced_price$MSSubClass)
#> [1] "20" "30" "40" "45" "50" "60" "70" "75" "80" "85" "90" "120"
#> [13] "160" "180" "190"
juiced_price %>%
count(MSSubClass)
#> # A tibble: 15 x 2
#> MSSubClass n
#> <fct> <int>
#> 1 20 536
#> 2 30 69
#> 3 40 4
#> 4 45 12
#> 5 50 144
#> 6 60 299
#> 7 70 60
#> 8 75 16
#> 9 80 58
#> 10 85 20
#> 11 90 52
#> 12 120 87
#> 13 160 63
#> 14 180 10
#> 15 190 30
由 reprex package (v0.3.0)
于 2020-05-03 创建在我看来,这可以让您获得想要的因子水平。如果您想将 .txt
文件中的那些字符串(例如“1-STORY 1945 & OLDER”)保存为 new_levels
向量,您可以说 factor(MSSubClass, levels = new_levels)
.