处理 TF IDF 稀疏向量中的数据或将其保存到 Dataframe 或外部文件中

Question

我正在使用以下代码使用 Pyspark 的 HashingTF 和 IDF 计算 TF 和 IDF：

from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

sc = SparkContext()

# Load documents (one per line).
documents = sc.textFile("random.txt").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)
tf.cache()

idf = IDF(minDocFreq=2).fit(tf)
tfidf = idf.transform(tf)

问题是：我可以使用 collect() 方法在屏幕上打印 tfidf，但是如何访问其中的特定数据或将整个 tfidf 向量空间保存到外部文件或 Dataframe？

Answer 1

HashingTF 和 IDF return RDD 其中每个元素都是 pyspark.mllib.linalg.Vector（Scala 中的 org.apache.spark.mllib.linalg.Vector）*。这意味着：

您可以使用简单索引访问单个索引。例如：

documents = sc.textFile("README.md").map(lambda line: line.split(" "))
tf = HashingTF().transform(documents)
idf = IDF().fit(tf)
tfidf = idf.transform(tf)

v = tfidf.first()
v
## SparseVector(1048576, {261052: 0.0, 362890: 0.0, 816618: 1.9253})

type(v)
## pyspark.mllib.linalg.SparseVector

v[0]
## 0.0

可以直接保存为文本文件。 Vectors 提供有意义的字符串表示和 parse 可用于恢复原始结构的方法。

from pyspark.mllib.linalg import Vectors

tfidf.saveAsTextFile("/tmp/tfidf")
sc.textFile("/tmp/tfidf/").map(Vectors.parse)

可以放在一个DataFrame

df = tfidf.map(lambda v: (v, )).toDF(["features"])

## df.printSchema()
## root
## |-- features: vector (nullable = true)

df.show(1, False)
## +-------------------------------------------------------------+
## |features                                                     |
## +-------------------------------------------------------------+
## |(1048576,[261052,362890,816618],[0.0,0.0,1.9252908618525775])|
## +-------------------------------------------------------------+
## only showing top 1 row

HashingTF is irreversible so it cannot be used to extract information about specific tokens. See

处理 TF IDF 稀疏向量中的数据或将其保存到 Dataframe 或外部文件中

Dealing with data inside TFIDF Sparevector or saving it to Dataframe or external file

python

apache-spark

pyspark

apache-spark-mllib