Tidymodels + Spark

Tidymodels + Spark

我正在尝试使用 Tidymodels 和 Spark 引擎开发一个简单的逻辑回归模型。当我指定 set_engine = "glm" 时我的代码工作正常,但当我尝试将引擎设置为 spark 时失败。任何建议将不胜感激!

library(tidyverse)
library(sparklyr)
library(tidymodels)
train.df <- titanic::titanic_train

train.df <- train.df %>% 
  mutate(Survived = factor(ifelse(Survived == 1, 'Y', 'N')),
         Sex = factor(Sex),
         Pclass = factor(Pclass))

skimr::skim(train.df)
# Just working with Spark locally.

sc <- spark_connect(master = 'local', version = '3.1')

train.spark.df <- copy_to(sc, train.df)
logistic.regression.recipe <- 
  recipe(Survived ~ PassengerId + Sex + Age + Pclass, data = train.spark.df) %>%
  update_role(PassengerId, new_role = 'ID') %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_impute_linear(all_predictors())

logistic.regression.recipe
summary(logistic.regression.recipe)
logistic.regression.model <- 
  logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("spark")

logistic.regression.model
logistic.regression.workflow <- 
  workflow() %>% 
  add_recipe(logistic.regression.recipe) %>% 
  add_model(logistic.regression.model)

logistic.regression.workflow
logistic.regression.final.model <- 
  logistic.regression.workflow %>% 
  fit(data = train.spark.df)

logistic.regression.final.model
Error: `data` must be a data.frame or a matrix, not a tbl_spark.

感谢阅读!

因此,tidymodels 中对 Spark 的支持甚至没有跨越建模分析的所有部分。 parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows 中对 建模 的支持。因此,例如,您可以只拟合逻辑回归模型:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(sparklyr)
#> 
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:purrr':
#> 
#>     invoke
#> The following object is masked from 'package:stats':
#> 
#>     filter

sc <- spark_connect(master = "local")
train_sp <- copy_to(sc, titanic::titanic_train, overwrite = TRUE)


log_spec <- logistic_reg() %>% set_engine("spark")

log_spec %>%
  fit(Survived ~ Sex + Fare + Pclass, data = train_sp)
#> parsnip model object
#> 
#> Fit time:  5.1s 
#> Formula: Survived ~ Sex + Fare + Pclass
#> 
#> Coefficients:
#>  (Intercept)     Sex_male         Fare       Pclass 
#>  3.143731639 -2.630648858  0.001450218 -0.917173436

reprex package (v2.0.0)

于 2021-07-09 创建

但是您不能开箱即用的方法和工作流程。您可能会考虑尝试 something like using spark_apply(),但在 tidymodels 与 Spark 集成的成熟阶段,这可能是一个挑战。