sparklyr 特征转换函数导致错误

sparklyr feature transformation functions result in error

我在使用 ft_.. sparklyr R 包中的函数时遇到了一些问题。 ft_bucketizer 有效,但 ft_normalizer 或 ft_min_max_scaler 无效。这是一个例子:

library(sparklyr)
library(dplyr)
library(nycflights13)

sc <- spark_connect(master = "local", version = "2.1.0")
x = flights %>% select(dep_delay)
x_tbl <- sdf_copy_to(sc, x) 

# works fine
ft_binarizer(x=x_tbl, input.col = "dep_delay", output.col = "delayed", threshold = 0)

# error
ft_normalizer(x= x_tbl, input.col = "dep_delay", output.col = "delayed_norm")

# error
ft_min_max_scaler(x= x_tbl,input.col = "dep_delay",output.col = "delayed_min_max")

标准化器returns:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 9, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc: (double) => vector)"

min_max_scaler returns:

"Error: java.lang.IllegalArgumentException: requirement failed: Column dep_delay must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually DoubleType."

我觉得是数据类型的问题,不知道怎么解决。有人知道该怎么做吗?

非常感谢!

ft_normalizerVector 列进行操作,因此您必须先使用 ft_vector_assembler

ft_vector_assembler(x_tbl, input_cols="dep_delay", output_col="dep_delay_v") %>% 
  ft_normalizer(input.col = "dep_delay_v", output.col = "delayed_v_norm")