Spark 2.1 无法在 CSV 上写入 Vector 字段
Spark 2.1 cannot write Vector field on CSV
我在将代码从 Spark 2.0 迁移到 2.1 时遇到了与 Dataframe 保存相关的问题。
这是代码
import org.apache.spark.sql.types._
import org.apache.spark.ml.linalg.VectorUDT
val df = spark.createDataFrame(Seq(Tuple1(1))).toDF("values")
val toSave = new org.apache.spark.ml.feature.VectorAssembler().setInputCols(Array("values")).transform(df)
toSave.write.csv(path)
此代码在使用 Spark 2.0.0 时成功
使用 Spark 2.1.0.cloudera1,出现以下错误:
java.lang.UnsupportedOperationException: CSV data source does not support struct<type:tinyint,size:int,indices:array<int>,values:array<double>> data type.
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.org$apache$spark$sql$execution$datasources$csv$CSVFileFormat$$verifyType(CSVFileFormat.scala:233)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema.apply(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema.apply(CSVFileFormat.scala:237)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:96)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.verifySchema(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.prepareWrite(CSVFileFormat.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:484)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:520)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:579)
... 50 elided
这只是我这边吗?
这与cloudera版本的Spark 2.1有关吗? (从他们的回购协议来看,他们似乎没有搞砸 spark.sql 所以也许没有)
谢谢!
以下回答来自@zero323的评论。
CSV 源不支持复杂对象。与您完全一样,异常:CSV 数据源不支持 struct,values:array > 数据类型。 是预期的行为。它不适用于 Spark 2.x,尽管它曾经在 1.x 中使用 spark-csv,其中向量已转换为字符串。
此行为在以下 jira 中是正确的 SPARK-16216。
作为解决方法,您可以使用此 fork, or take the solution described 中的 VectorDisassembler class。
我使用 VectorDisassembler 将 ml.feature.StandardScaler.fit 方法的结果数据帧存储到 CSV 文件中。
代码大致如下:
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
val disassembledDF = disassembler.setInputCol("scaledFeatures").transform(df)
disassembledDF.show()
我在将代码从 Spark 2.0 迁移到 2.1 时遇到了与 Dataframe 保存相关的问题。
这是代码
import org.apache.spark.sql.types._
import org.apache.spark.ml.linalg.VectorUDT
val df = spark.createDataFrame(Seq(Tuple1(1))).toDF("values")
val toSave = new org.apache.spark.ml.feature.VectorAssembler().setInputCols(Array("values")).transform(df)
toSave.write.csv(path)
此代码在使用 Spark 2.0.0 时成功
使用 Spark 2.1.0.cloudera1,出现以下错误:
java.lang.UnsupportedOperationException: CSV data source does not support struct<type:tinyint,size:int,indices:array<int>,values:array<double>> data type.
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.org$apache$spark$sql$execution$datasources$csv$CSVFileFormat$$verifyType(CSVFileFormat.scala:233)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema.apply(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema.apply(CSVFileFormat.scala:237)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:96)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.verifySchema(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.prepareWrite(CSVFileFormat.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:484)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:520)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:579)
... 50 elided
这只是我这边吗?
这与cloudera版本的Spark 2.1有关吗? (从他们的回购协议来看,他们似乎没有搞砸 spark.sql 所以也许没有)
谢谢!
以下回答来自@zero323的评论。
CSV 源不支持复杂对象。与您完全一样,异常:CSV 数据源不支持 struct,values:array > 数据类型。 是预期的行为。它不适用于 Spark 2.x,尽管它曾经在 1.x 中使用 spark-csv,其中向量已转换为字符串。
此行为在以下 jira 中是正确的 SPARK-16216。
作为解决方法,您可以使用此 fork, or take the solution described
我使用 VectorDisassembler 将 ml.feature.StandardScaler.fit 方法的结果数据帧存储到 CSV 文件中。
代码大致如下:
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
val disassembledDF = disassembler.setInputCol("scaledFeatures").transform(df)
disassembledDF.show()