Python Spark DataFrame:用 SparseVector 替换 null

Python Spark DataFrame: replace null with SparseVector

在 spark 中,我有以下名为 "df" 的数据框,其中包含一些空条目:

+-------+--------------------+--------------------+                     
|     id|           features1|           features2|
+-------+--------------------+--------------------+
|    185|(5,[0,1,4],[0.1,0...|                null|
|    220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...|
|    225|                null|(10,[1,3,5],[0.1,...|
+-------+--------------------+--------------------+

df.features1 和 df.features2 是类型向量(可为空)。然后我尝试使用以下代码用 SparseVectors 填充空条目:

df1 = df.na.fill({"features1":SparseVector(5,{}), "features2":SparseVector(10, {})})

此代码导致以下错误:

AttributeError: 'SparseVector' object has no attribute '_get_object_id'

然后我在 spark 文档中找到了以下段落:

fillna(value, subset=None)
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.

Parameters: 
value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.

这是否解释了我未能在 DataFrame 中用 SparseVectors 替换空条目?或者这是否意味着在 DataFrame 中无法执行此操作?

我可以通过将 DataFrame 转换为 RDD 并将 None 值替换为 SparseVectors 来实现我的目标,但直接在 DataFrame 中执行此操作对我来说会方便得多。

有没有什么方法可以直接在DataFrame中做到这一点? 谢谢!

您可以使用 udf:

from pyspark.sql.functions import udf, lit
from pyspark.ml.linalg import *

fill_with_vector = udf(
    lambda x, i: x if x is not None else SparseVector(i, {}),
    VectorUDT()
)

df = sc.parallelize([
    (SparseVector(5, {1: 1.0}), SparseVector(10, {1: -1.0})), (None, None)
]).toDF(["features1", "features2"])

(df
    .withColumn("features1", fill_with_vector("features1", lit(5)))
    .withColumn("features2", fill_with_vector("features2", lit(10)))
    .show())

# +-------------+---------------+
# |    features1|      features2|
# +-------------+---------------+
# |(5,[1],[1.0])|(10,[1],[-1.0])|
# |    (5,[],[])|     (10,[],[])|
# +-------------+---------------+