VectorAssembler 的输入需要什么数据类型？

Question

核心问题就在这里

from pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([([1, 2, 3], 0, 3)], ["a", "b", "c"])
vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])
vecAssembler.transform(df).show()

错误 IllegalArgumentException: Data type array<bigint> of column a is not supported.

我知道这有点像玩具问题，但我正在尝试通过步骤将其集成到更长的管道中

字符串索引器
OneHotEncoding
自定义 UnaryTransformer 以将所有 1 乘以 10
- 这里应该返回什么数据类型？
然后 VectorAssembler 将向量组合成单个向量进行建模。

如果我能确定 VectorAssembler 的正确输入数据类型，我应该能够正确地将所有内容串在一起。我认为输入类型是 Vector，但我不知道如何构建它。

Answer 1

根据docs,

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.

因此您需要先将数组列转换为向量列（中的方法）。

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
list_to_vector_udf = udf(lambda l: Vectors.dense(l), VectorUDT())
df_with_vectors = df.withColumn('a', list_to_vector_udf('a'))

然后你可以使用矢量汇编器：

vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])

vecAssembler.transform(df_with_vectors).show(truncate=False)
+-------------+---+---+---------------------+
|a            |b  |c  |features             |
+-------------+---+---+---------------------+
|[1.0,2.0,3.0]|0  |3  |[1.0,2.0,3.0,0.0,3.0]|
+-------------+---+---+---------------------+

VectorAssembler 的输入需要什么数据类型？

What data type does VectorAssembler require for an input?

python

apache-spark

pyspark

apache-spark-ml