Pyspark - 将另一列添加到稀疏向量列
Pyspark - add another column to a sparse vector column
我有一个 PySpark 数据框,其中一列 (features
) 是一个稀疏向量。例如:
+------------------+-----+
| features |label|
+------------------+-----+
| (4823,[87],[0.0])| 0.0|
| (4823,[31],[2.0])| 0.0|
|(4823,[159],[0.0])| 1.0|
| (4823,[1],[7.0])| 0.0|
|(4823,[15],[27.0])| 0.0|
+------------------+-----+
我想扩展 features
列并向其添加另一个功能,例如:
+-------------------+-----+
| features |label|
+-------------------+-----+
| (4824,[87],[0.0]) | 0.0|
| (4824,[31],[2.0]) | 0.0|
|(4824,[159],[0.0]) | 1.0|
| (4824,[1],[7.0]) | 0.0|
|(4824,[4824],[7.0])| 0.0|
+-------------------+-----+
有没有办法在不将 SparseVector
解压缩为密集然后使用新列将其重新打包为稀疏的情况下执行此操作?
向现有 SparseVector
添加新列可以使用 VectorAssembler
transformer in the ML library. It will automatically combine columns into a vector (DenseVector
or SparseVector
depending on which use the least memory). Using VectorAssembler
will not convert the vector into a DenseVector
during the merging process (see the source code) 最简单。可以按如下方式使用:
df = ...
assembler = VectorAssembler(
inputCols=["features", "new_col"],
outputCol="features")
output = assembler.transform(df)
要简单地增加 SparseVector
的大小而不添加任何新值,只需创建一个更大大小的新向量:
def add_empty_col_(v):
return SparseVector(v.size + 1, v.indices, v.values)
add_empty_col = udf(add_empty_col_, VectorUDT())
df.withColumn("sparse", add_empty_col(col("features"))
我有一个 PySpark 数据框,其中一列 (features
) 是一个稀疏向量。例如:
+------------------+-----+
| features |label|
+------------------+-----+
| (4823,[87],[0.0])| 0.0|
| (4823,[31],[2.0])| 0.0|
|(4823,[159],[0.0])| 1.0|
| (4823,[1],[7.0])| 0.0|
|(4823,[15],[27.0])| 0.0|
+------------------+-----+
我想扩展 features
列并向其添加另一个功能,例如:
+-------------------+-----+
| features |label|
+-------------------+-----+
| (4824,[87],[0.0]) | 0.0|
| (4824,[31],[2.0]) | 0.0|
|(4824,[159],[0.0]) | 1.0|
| (4824,[1],[7.0]) | 0.0|
|(4824,[4824],[7.0])| 0.0|
+-------------------+-----+
有没有办法在不将 SparseVector
解压缩为密集然后使用新列将其重新打包为稀疏的情况下执行此操作?
向现有 SparseVector
添加新列可以使用 VectorAssembler
transformer in the ML library. It will automatically combine columns into a vector (DenseVector
or SparseVector
depending on which use the least memory). Using VectorAssembler
will not convert the vector into a DenseVector
during the merging process (see the source code) 最简单。可以按如下方式使用:
df = ...
assembler = VectorAssembler(
inputCols=["features", "new_col"],
outputCol="features")
output = assembler.transform(df)
要简单地增加 SparseVector
的大小而不添加任何新值,只需创建一个更大大小的新向量:
def add_empty_col_(v):
return SparseVector(v.size + 1, v.indices, v.values)
add_empty_col = udf(add_empty_col_, VectorUDT())
df.withColumn("sparse", add_empty_col(col("features"))