Pyspark 中的向量汇编器正在创建多个向量的元组而不是单个向量,如何解决这个问题?
Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue?
我的python版本是3.6.3,spark版本是2.2.1。这是我的代码:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession.builder.appName("Data Preprocessor") \
.config("spark.some.config.option", "1") \
.getOrCreate()
dataset = spark.createDataFrame([(0, 59.0, 0.0, Vectors.dense([2.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 9.0, 9.0, 9.0]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features").show(truncate=False)
Instead of getting a single vector, I am getting following output:
(12,[0,2,9,10,11],[59.0,2.0,9.0,9.0,9.0])
vectorAssembler 返回的向量是 sparseVector 形式。
12 是特征的数量。 ([0,2,9,10,11]) 是非零值的索引。 [59.0,2.0,9.0,9.0,9.0] 是非零值。
我的python版本是3.6.3,spark版本是2.2.1。这是我的代码:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession.builder.appName("Data Preprocessor") \
.config("spark.some.config.option", "1") \
.getOrCreate()
dataset = spark.createDataFrame([(0, 59.0, 0.0, Vectors.dense([2.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 9.0, 9.0, 9.0]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features").show(truncate=False)
Instead of getting a single vector, I am getting following output:
(12,[0,2,9,10,11],[59.0,2.0,9.0,9.0,9.0])
vectorAssembler 返回的向量是 sparseVector 形式。 12 是特征的数量。 ([0,2,9,10,11]) 是非零值的索引。 [59.0,2.0,9.0,9.0,9.0] 是非零值。