VectorAssembler 的输入需要什么数据类型?

What data type does VectorAssembler require for an input?

核心问题就在这里

from pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([([1, 2, 3], 0, 3)], ["a", "b", "c"])
vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])
vecAssembler.transform(df).show()

错误 IllegalArgumentException: Data type array<bigint> of column a is not supported.

我知道这有点像玩具问题,但我正在尝试通过步骤将其集成到更长的管道中

如果我能确定 VectorAssembler 的正确输入数据类型,我应该能够正确地将所有内容串在一起。我认为输入类型是 Vector,但我不知道如何构建它。

根据docs,

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.

因此您需要先将数组列转换为向量列( 中的方法)。

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
list_to_vector_udf = udf(lambda l: Vectors.dense(l), VectorUDT())
df_with_vectors = df.withColumn('a', list_to_vector_udf('a'))

然后你可以使用矢量汇编器:

vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])

vecAssembler.transform(df_with_vectors).show(truncate=False)
+-------------+---+---+---------------------+
|a            |b  |c  |features             |
+-------------+---+---+---------------------+
|[1.0,2.0,3.0]|0  |3  |[1.0,2.0,3.0,0.0,3.0]|
+-------------+---+---+---------------------+