Tidymodels + Spark
Tidymodels + Spark
我正在尝试使用 Tidymodels 和 Spark 引擎开发一个简单的逻辑回归模型。当我指定 set_engine = "glm"
时我的代码工作正常,但当我尝试将引擎设置为 spark
时失败。任何建议将不胜感激!
library(tidyverse)
library(sparklyr)
library(tidymodels)
train.df <- titanic::titanic_train
train.df <- train.df %>%
mutate(Survived = factor(ifelse(Survived == 1, 'Y', 'N')),
Sex = factor(Sex),
Pclass = factor(Pclass))
skimr::skim(train.df)
# Just working with Spark locally.
sc <- spark_connect(master = 'local', version = '3.1')
train.spark.df <- copy_to(sc, train.df)
logistic.regression.recipe <-
recipe(Survived ~ PassengerId + Sex + Age + Pclass, data = train.spark.df) %>%
update_role(PassengerId, new_role = 'ID') %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_impute_linear(all_predictors())
logistic.regression.recipe
summary(logistic.regression.recipe)
logistic.regression.model <-
logistic_reg() %>%
set_mode("classification") %>%
set_engine("spark")
logistic.regression.model
logistic.regression.workflow <-
workflow() %>%
add_recipe(logistic.regression.recipe) %>%
add_model(logistic.regression.model)
logistic.regression.workflow
logistic.regression.final.model <-
logistic.regression.workflow %>%
fit(data = train.spark.df)
logistic.regression.final.model
Error: `data` must be a data.frame or a matrix, not a tbl_spark.
感谢阅读!
因此,tidymodels 中对 Spark 的支持甚至没有跨越建模分析的所有部分。 parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows 中对 建模 的支持。因此,例如,您可以只拟合逻辑回归模型:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(sparklyr)
#>
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:purrr':
#>
#> invoke
#> The following object is masked from 'package:stats':
#>
#> filter
sc <- spark_connect(master = "local")
train_sp <- copy_to(sc, titanic::titanic_train, overwrite = TRUE)
log_spec <- logistic_reg() %>% set_engine("spark")
log_spec %>%
fit(Survived ~ Sex + Fare + Pclass, data = train_sp)
#> parsnip model object
#>
#> Fit time: 5.1s
#> Formula: Survived ~ Sex + Fare + Pclass
#>
#> Coefficients:
#> (Intercept) Sex_male Fare Pclass
#> 3.143731639 -2.630648858 0.001450218 -0.917173436
由 reprex package (v2.0.0)
于 2021-07-09 创建
但是您不能开箱即用的方法和工作流程。您可能会考虑尝试 something like using spark_apply()
,但在 tidymodels 与 Spark 集成的成熟阶段,这可能是一个挑战。
我正在尝试使用 Tidymodels 和 Spark 引擎开发一个简单的逻辑回归模型。当我指定 set_engine = "glm"
时我的代码工作正常,但当我尝试将引擎设置为 spark
时失败。任何建议将不胜感激!
library(tidyverse)
library(sparklyr)
library(tidymodels)
train.df <- titanic::titanic_train
train.df <- train.df %>%
mutate(Survived = factor(ifelse(Survived == 1, 'Y', 'N')),
Sex = factor(Sex),
Pclass = factor(Pclass))
skimr::skim(train.df)
# Just working with Spark locally.
sc <- spark_connect(master = 'local', version = '3.1')
train.spark.df <- copy_to(sc, train.df)
logistic.regression.recipe <-
recipe(Survived ~ PassengerId + Sex + Age + Pclass, data = train.spark.df) %>%
update_role(PassengerId, new_role = 'ID') %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_impute_linear(all_predictors())
logistic.regression.recipe
summary(logistic.regression.recipe)
logistic.regression.model <-
logistic_reg() %>%
set_mode("classification") %>%
set_engine("spark")
logistic.regression.model
logistic.regression.workflow <-
workflow() %>%
add_recipe(logistic.regression.recipe) %>%
add_model(logistic.regression.model)
logistic.regression.workflow
logistic.regression.final.model <-
logistic.regression.workflow %>%
fit(data = train.spark.df)
logistic.regression.final.model
Error: `data` must be a data.frame or a matrix, not a tbl_spark.
感谢阅读!
因此,tidymodels 中对 Spark 的支持甚至没有跨越建模分析的所有部分。 parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows 中对 建模 的支持。因此,例如,您可以只拟合逻辑回归模型:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(sparklyr)
#>
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:purrr':
#>
#> invoke
#> The following object is masked from 'package:stats':
#>
#> filter
sc <- spark_connect(master = "local")
train_sp <- copy_to(sc, titanic::titanic_train, overwrite = TRUE)
log_spec <- logistic_reg() %>% set_engine("spark")
log_spec %>%
fit(Survived ~ Sex + Fare + Pclass, data = train_sp)
#> parsnip model object
#>
#> Fit time: 5.1s
#> Formula: Survived ~ Sex + Fare + Pclass
#>
#> Coefficients:
#> (Intercept) Sex_male Fare Pclass
#> 3.143731639 -2.630648858 0.001450218 -0.917173436
由 reprex package (v2.0.0)
于 2021-07-09 创建但是您不能开箱即用的方法和工作流程。您可能会考虑尝试 something like using spark_apply()
,但在 tidymodels 与 Spark 集成的成熟阶段,这可能是一个挑战。