数据类型字符串的 Spark DataFrame CountVectorizedModel 错误
Spark DataFrame CountVectorizedModel Error With DataType String
我有以下代码尝试执行一个简单的操作,我试图从稀疏向量转换为密集向量。这是我目前所拥有的:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
import org.apache.spark.ml.feature.CountVectorizerModel
import org.apache.spark.mllib.linalg.Vector
import spark.implicits._
// Identify how many distinct values are in the OCEAN_PROXIMITY column
val distinctOceanProximities = dfRaw.select(col("ocean_proximity")).distinct().as[String].collect()
val cvmDF = new CountVectorizerModel(tags)
.setInputCol("ocean_proximity")
.setOutputCol("sparseFeatures")
.transform(dfRaw)
val exprs = (0 until distinctOceanProximities.size).map(i => $"features".getItem(i).alias(s"$distinctOceanProximities(i)"))
val vecToSeq = udf((v: Vector) => v.toArray)
val df2 = cvmDF.withColumn("features", vecToSeq($"sparseFeatures")).select(exprs:_*)
df2.show()
当我运行这个脚本时,我得到以下错误:
java.lang.IllegalArgumentException: requirement failed: Column ocean_proximity must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type string.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:63)
at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema(CountVectorizer.scala:97)
at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema$(CountVectorizer.scala:95)
at org.apache.spark.ml.feature.CountVectorizerModel.validateAndTransformSchema(CountVectorizer.scala:272)
at org.apache.spark.ml.feature.CountVectorizerModel.transformSchema(CountVectorizer.scala:338)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.feature.CountVectorizerModel.transform(CountVectorizer.scala:306)
... 101 elided
我认为数据类型需要一个字符串序列,但我只有一个字符串。有什么解决办法吗?
这很简单。我所要做的就是将列从字符串转换为字符串数组,就像这样:
val oceanProximityAsArrayDF = dfRaw.withColumn("ocean_proximity", array("ocean_proximity"))
我有以下代码尝试执行一个简单的操作,我试图从稀疏向量转换为密集向量。这是我目前所拥有的:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}
import org.apache.spark.ml.feature.CountVectorizerModel
import org.apache.spark.mllib.linalg.Vector
import spark.implicits._
// Identify how many distinct values are in the OCEAN_PROXIMITY column
val distinctOceanProximities = dfRaw.select(col("ocean_proximity")).distinct().as[String].collect()
val cvmDF = new CountVectorizerModel(tags)
.setInputCol("ocean_proximity")
.setOutputCol("sparseFeatures")
.transform(dfRaw)
val exprs = (0 until distinctOceanProximities.size).map(i => $"features".getItem(i).alias(s"$distinctOceanProximities(i)"))
val vecToSeq = udf((v: Vector) => v.toArray)
val df2 = cvmDF.withColumn("features", vecToSeq($"sparseFeatures")).select(exprs:_*)
df2.show()
当我运行这个脚本时,我得到以下错误:
java.lang.IllegalArgumentException: requirement failed: Column ocean_proximity must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type string.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:63)
at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema(CountVectorizer.scala:97)
at org.apache.spark.ml.feature.CountVectorizerParams.validateAndTransformSchema$(CountVectorizer.scala:95)
at org.apache.spark.ml.feature.CountVectorizerModel.validateAndTransformSchema(CountVectorizer.scala:272)
at org.apache.spark.ml.feature.CountVectorizerModel.transformSchema(CountVectorizer.scala:338)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.feature.CountVectorizerModel.transform(CountVectorizer.scala:306)
... 101 elided
我认为数据类型需要一个字符串序列,但我只有一个字符串。有什么解决办法吗?
这很简单。我所要做的就是将列从字符串转换为字符串数组,就像这样:
val oceanProximityAsArrayDF = dfRaw.withColumn("ocean_proximity", array("ocean_proximity"))