Pyspark - 将另一列添加到稀疏向量列

Pyspark - add another column to a sparse vector column

我有一个 PySpark 数据框,其中一列 (features) 是一个稀疏向量。例如:

+------------------+-----+
|     features     |label|
+------------------+-----+
| (4823,[87],[0.0])|  0.0|
| (4823,[31],[2.0])|  0.0|
|(4823,[159],[0.0])|  1.0|
|  (4823,[1],[7.0])|  0.0|
|(4823,[15],[27.0])|  0.0|
+------------------+-----+

我想扩展 features 列并向其添加另一个功能,例如:

+-------------------+-----+
|     features      |label|
+-------------------+-----+
| (4824,[87],[0.0]) |  0.0|
| (4824,[31],[2.0]) |  0.0|
|(4824,[159],[0.0]) |  1.0|
|  (4824,[1],[7.0]) |  0.0|
|(4824,[4824],[7.0])|  0.0|
+-------------------+-----+

有没有办法在不将 SparseVector 解压缩为密集然后使用新列将其重新打包为稀疏的情况下执行此操作?

向现有 SparseVector 添加新列可以使用 VectorAssembler transformer in the ML library. It will automatically combine columns into a vector (DenseVector or SparseVector depending on which use the least memory). Using VectorAssembler will not convert the vector into a DenseVector during the merging process (see the source code) 最简单。可以按如下方式使用:

df = ...

assembler = VectorAssembler(
    inputCols=["features", "new_col"],
    outputCol="features")

output = assembler.transform(df)

要简单地增加 SparseVector 的大小而不添加任何新值,只需创建一个更大大小的新向量:

def add_empty_col_(v):
    return SparseVector(v.size + 1, v.indices, v.values)

add_empty_col = udf(add_empty_col_, VectorUDT())
df.withColumn("sparse", add_empty_col(col("features"))