PySpark 在 Dataframe 列中插入常量 SparseVector
PySpark insert a constant SparseVector in a Dataframe column
我想在我的数据框 tfIdfFr
中插入一个名为 "ref"
的列,其中包含一个类型为 pyspark.ml.linalg.SparseVector
.
的常量
当我尝试这个时
ref = tfidfTest.select("features").collect()[0].features # the reference
tfIdfFr.withColumn("ref", ref).select("ref", "features").show()
我收到这个错误 AssertionError: col should be Column
当我尝试这个时:
from pyspark.sql.functions import lit
tfIdfFr.withColumn("ref", lit(ref)).select("ref", "features").show()
我收到那个错误 AttributeError: 'SparseVector' object has no attribute '_get_object_id'
您知道在 Dataframe 列中插入常量 SparseVector 的解决方案吗?*
在这种情况下,我将跳过收集:
ref = tfidfTest.select(col("features").alias("ref")).limit(1)
tfIdfFr.crossJoin(ref)
一般来说,您可以使用 udf
:
from pyspark.ml.linalg import DenseVector, SparseVector, Vector, Vectors, \
VectorUDT
from pyspark.sql.functions import udf
def vector_lit(v):
assert isinstance(v, Vector)
return udf(lambda: v, VectorUDT())()
用法:
spark.range(1).select(
vector_lit(Vectors.sparse(5, [1, 3], [-1, 1])
).alias("ref")).show()
+--------------------+
| ref|
+--------------------+
|(5,[1,3],[-1.0,1.0])|
+--------------------+
spark.range(1).select(vector_lit(Vectors.dense([1, 2, 3])).alias("ref")).show()
+-------------+
| ref|
+-------------+
|[1.0,2.0,3.0]|
+-------------+
也可以使用中间表示:
import json
from pyspark.sql.functions import from_json, lit
from pyspark.sql.types import StructType, StructField
def as_column(v):
assert isinstance(v, Vector)
if isinstance(v, DenseVector):
j = lit(json.dumps({"v": {
"type": 1,
"values": v.values.tolist()
}}))
else:
j = lit(json.dumps({"v": {
"type": 0,
"size": v.size,
"indices": v.indices.tolist(),
"values": v.values.tolist()
}}))
return from_json(j, StructType([StructField("v", VectorUDT())]))["v"]
用法:
spark.range(1).select(
as_column(Vectors.sparse(5, [1, 3], [-1, 1])
).alias("ref")).show()
+--------------------+
| ref|
+--------------------+
|(5,[1,3],[-1.0,1.0])|
+--------------------+
spark.range(1).select(as_column(Vectors.dense([1, 2, 3])).alias("ref")).show()
+-------------+
| ref|
+-------------+
|[1.0,2.0,3.0]|
+-------------+
我想在我的数据框 tfIdfFr
中插入一个名为 "ref"
的列,其中包含一个类型为 pyspark.ml.linalg.SparseVector
.
当我尝试这个时
ref = tfidfTest.select("features").collect()[0].features # the reference
tfIdfFr.withColumn("ref", ref).select("ref", "features").show()
我收到这个错误 AssertionError: col should be Column
当我尝试这个时:
from pyspark.sql.functions import lit
tfIdfFr.withColumn("ref", lit(ref)).select("ref", "features").show()
我收到那个错误 AttributeError: 'SparseVector' object has no attribute '_get_object_id'
您知道在 Dataframe 列中插入常量 SparseVector 的解决方案吗?*
在这种情况下,我将跳过收集:
ref = tfidfTest.select(col("features").alias("ref")).limit(1)
tfIdfFr.crossJoin(ref)
一般来说,您可以使用 udf
:
from pyspark.ml.linalg import DenseVector, SparseVector, Vector, Vectors, \
VectorUDT
from pyspark.sql.functions import udf
def vector_lit(v):
assert isinstance(v, Vector)
return udf(lambda: v, VectorUDT())()
用法:
spark.range(1).select(
vector_lit(Vectors.sparse(5, [1, 3], [-1, 1])
).alias("ref")).show()
+--------------------+
| ref|
+--------------------+
|(5,[1,3],[-1.0,1.0])|
+--------------------+
spark.range(1).select(vector_lit(Vectors.dense([1, 2, 3])).alias("ref")).show()
+-------------+
| ref|
+-------------+
|[1.0,2.0,3.0]|
+-------------+
也可以使用中间表示:
import json
from pyspark.sql.functions import from_json, lit
from pyspark.sql.types import StructType, StructField
def as_column(v):
assert isinstance(v, Vector)
if isinstance(v, DenseVector):
j = lit(json.dumps({"v": {
"type": 1,
"values": v.values.tolist()
}}))
else:
j = lit(json.dumps({"v": {
"type": 0,
"size": v.size,
"indices": v.indices.tolist(),
"values": v.values.tolist()
}}))
return from_json(j, StructType([StructField("v", VectorUDT())]))["v"]
用法:
spark.range(1).select(
as_column(Vectors.sparse(5, [1, 3], [-1, 1])
).alias("ref")).show()
+--------------------+
| ref|
+--------------------+
|(5,[1,3],[-1.0,1.0])|
+--------------------+
spark.range(1).select(as_column(Vectors.dense([1, 2, 3])).alias("ref")).show()
+-------------+
| ref|
+-------------+
|[1.0,2.0,3.0]|
+-------------+