无法将类型 <class 'pyspark.ml.linalg.SparseVector'> 转换为 Vector

Question

鉴于我的 pyspark 行对象：

>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>

然而，row.features未能通过 isinstance(row.features,Vector) 测试。

>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False

这个奇怪的错误让我遇到了很大的麻烦。如果不通过 "isinstance(row.features, Vector),"，我将无法使用 map 函数生成 LabeledPoint。如果有人能解决这个问题，我将不胜感激。

Answer 1

不太可能是错误。您没有提供 code required to reproduce the issue 但很可能您将 Spark 2.0 与 ML 转换器一起使用并且比较了错误的实体。

让我们用一个例子来说明。简单数据

from pyspark.ml.feature import OneHotEncoder

row = OneHotEncoder(inputCol="x", outputCol="features").transform(
    sc.parallelize([(1.0, )]).toDF(["x"])
).first()

现在让我们导入不同的向量类:

from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import  LabeledPoint

并进行测试：

isinstance(row.features, MLLibVector)

False

isinstance(row.features, MLVector)

True

如您所见，我们拥有的是 pyspark.ml.linalg.Vector 而不是 pyspark.mllib.linalg.Vector，它与旧的 API:

不兼容

LabeledPoint(0.0, row.features)

TypeError                                 Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

您可以将 ML 对象转换为 MLLib：

from pyspark.ml import linalg as ml_linalg

def as_mllib(v):
    if isinstance(v, ml_linalg.SparseVector):
        return MLLibVectors.sparse(v.size, v.indices, v.values)
    elif isinstance(v, ml_linalg.DenseVector):
        return MLLibVectors.dense(v.toArray())
    else:
        raise TypeError("Unsupported type: {0}".format(type(v)))

LabeledPoint(0, as_mllib(row.features))

LabeledPoint(0.0, (1,[],[]))

或者简单地说：

LabeledPoint(0, MLLibVectors.fromML(row.features))

LabeledPoint(0.0, (1,[],[]))

但一般来说，你应该在必要时避免出现这种情况。

Answer 2

如果您只想将 SparseVectors 从 pyspark.ml 转换为 pyspark.mllib SparseVectors，您可以使用 MLUtils。假设 df 是您的数据框，带有 SparseVectors 的列被命名为“特征”。然后下面几行让你完成这个：

from pyspark.mllib.util import MLUtils
df = MLUtils.convertVectorColumnsFromML(df, "features")

我遇到了这个问题，因为当使用 pyspark.ml.feature 中的 CountVectorizer 时，我无法创建 LabeledPoints，因为与 pyspark.ml

中的 SparseVector 不兼容

我想知道为什么他们的最新文档 CountVectorizer 不使用“新的”SparseVector class。由于 classification 算法需要 LabeledPoints 这对我来说毫无意义...

更新：误以为ml库是为DataFrame-Objects设计的，mllib库是为RDD-objects设计的。自 Spark > 2,0 以来建议使用 DataFrame-Datastructure，因为 SparkSession 比 SparkContext 更兼容（但存储 SparkContext-object）并且确实提供 DataFrame 而不是 RDD。我发现这个 post 让我产生了“啊哈”效果：mllib and ml。谢谢 Alberto Bonsanto :).

使用f.e。来自 mllib 的 NaiveBayes，我不得不将我的 DataFrame 转换为来自 mllib 的 NaiveBayes 的 LabeledPoint-objects。

但是使用 ml 中的 NaiveBayes 更容易，因为您不需要 LabeledPoints，而只需为您的数据框指定特征和 class-col。

PS：我为这个问题苦苦挣扎了几个小时，所以我觉得我需要在这里 post :)

无法将类型 <class 'pyspark.ml.linalg.SparseVector'> 转换为 Vector

Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

apache-spark

apache-spark-sql

pyspark

apache-spark-ml

apache-spark-mllib