使用 Spark CountVectorizer 时如何 "normalize" 矢量值?
how to "normalize" vectors values when using Spark CountVectorizer?
CountVectorizer
和 CountVectorizerModel
通常会创建一个稀疏特征向量,如下所示:
(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])
这基本上表示词汇表的总大小为 10,当前文档有 5 个唯一元素,在特征向量中,这 5 个唯一元素分别位于 0、1、4、6 和 8。另外,其中一个元素出现两次,因此值为 2.0。
现在,我想"normalize"上面的特征向量,让它看起来像这样,
(10,[0,1,4,6,8],[0.3333,0.1667,0.1667,0.1667,0.1667])
即每个值除以6,即所有元素加在一起的总数。例如,0.3333 = 2.0/6
.
那么这里有没有办法有效地做到这一点?
谢谢!
您可以使用Normalizer
class pyspark.ml.feature.Normalizer(*args, **kwargs)
Normalize a vector to have unit norm using the given p-norm.
from pyspark.ml.linalg import SparseVector
from pyspark.ml.feature import Normalizer
df = spark.createDataFrame([
(SparseVector(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0]), )
], ["features"])
Normalizer(inputCol="features", outputCol="features_norm", p=1).transform(df).show(1, False)
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |features |features_norm |
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])|(10,[0,1,4,6,8],[0.3333333333333333,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666])|
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
CountVectorizer
和 CountVectorizerModel
通常会创建一个稀疏特征向量,如下所示:
(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])
这基本上表示词汇表的总大小为 10,当前文档有 5 个唯一元素,在特征向量中,这 5 个唯一元素分别位于 0、1、4、6 和 8。另外,其中一个元素出现两次,因此值为 2.0。
现在,我想"normalize"上面的特征向量,让它看起来像这样,
(10,[0,1,4,6,8],[0.3333,0.1667,0.1667,0.1667,0.1667])
即每个值除以6,即所有元素加在一起的总数。例如,0.3333 = 2.0/6
.
那么这里有没有办法有效地做到这一点?
谢谢!
您可以使用Normalizer
class pyspark.ml.feature.Normalizer(*args, **kwargs)
Normalize a vector to have unit norm using the given p-norm.
from pyspark.ml.linalg import SparseVector
from pyspark.ml.feature import Normalizer
df = spark.createDataFrame([
(SparseVector(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0]), )
], ["features"])
Normalizer(inputCol="features", outputCol="features_norm", p=1).transform(df).show(1, False)
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |features |features_norm |
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])|(10,[0,1,4,6,8],[0.3333333333333333,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666])|
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+