如何使用PySpark获取最高tf-idf对应的词?
How to get the word corresponding to highest tf-idf using PySpark?
我看过类似的帖子,但没有完整的答案,因此在这里发帖。
我在 Spark 中使用 TF-IDF 来获取文档中具有最大 tf-idf 值的单词。我使用以下代码。
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover
tokenizer = Tokenizer(inputCol="doc_cln", outputCol="tokens")
remover1 = StopWordsRemover(inputCol="tokens",
outputCol="stopWordsRemovedTokens")
stopwordList =["word1","word2","word3"]
remover2 = StopWordsRemover(inputCol="stopWordsRemovedTokens",
outputCol="filtered" ,stopWords=stopwordList)
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=2000)
idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[tokenizer, remover1, remover2, hashingTF, idf])
model = pipeline.fit(df)
results = model.transform(df)
results.cache()
我得到了
这样的结果
|[a8g4i9g5y, hwcdn] |(2000,[905,1104],[7.34977707433047,7.076179741760428])
哪里
filtered: array (nullable = true)
features: vector (nullable = true)
如何获取从 "feature" 中提取的数组?理想情况下,我想得到最高tfidf对应的词,如下图
|a8g4i9g5y|7.34977707433047
提前致谢!
您的 feature
列的类型 vector
来自数据包 pyspark.ml.linalg
。可以是
根据您的数据 (2000,[905,1104],[7.34977707433047,7.076179741760428])
,显然是 SparseVector
,它可以分解为 3 个主要部分:
size
: 2000
indices
: [905,1104]
values
: [7.34977707433047,7.076179741760428]
而您要查找的是该向量的 属性 values
。
对于其他 'literal' PySpark SQL 类型,例如 StringType
或 IntegerType
,您可以使用以下方法访问其属性(和聚合函数) SQL 函数包 (docs)。但是 vector
不是文字 SQL 类型,访问其属性的唯一方法是通过 UDF,如下所示:
# Important: `vector.values` returns ndarray from numpy.
# PySpark doesn't understand ndarray, therefore you'd want to
# convert it to normal Python list using `tolist`
def extract_values_from_vector(vector):
return vector.values.tolist()
# Just a regular UDF
def extract_values_from_vector_udf(col):
return udf(extract_values_from_vector, ArrayType(DoubleType()))
# And use that UDF to get your values
results.select(extract_values_from_vector_udf('features'), 'features')
我看过类似的帖子,但没有完整的答案,因此在这里发帖。
我在 Spark 中使用 TF-IDF 来获取文档中具有最大 tf-idf 值的单词。我使用以下代码。
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover
tokenizer = Tokenizer(inputCol="doc_cln", outputCol="tokens")
remover1 = StopWordsRemover(inputCol="tokens",
outputCol="stopWordsRemovedTokens")
stopwordList =["word1","word2","word3"]
remover2 = StopWordsRemover(inputCol="stopWordsRemovedTokens",
outputCol="filtered" ,stopWords=stopwordList)
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=2000)
idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[tokenizer, remover1, remover2, hashingTF, idf])
model = pipeline.fit(df)
results = model.transform(df)
results.cache()
我得到了
这样的结果|[a8g4i9g5y, hwcdn] |(2000,[905,1104],[7.34977707433047,7.076179741760428])
哪里
filtered: array (nullable = true)
features: vector (nullable = true)
如何获取从 "feature" 中提取的数组?理想情况下,我想得到最高tfidf对应的词,如下图
|a8g4i9g5y|7.34977707433047
提前致谢!
您的
feature
列的类型vector
来自数据包pyspark.ml.linalg
。可以是根据您的数据
(2000,[905,1104],[7.34977707433047,7.076179741760428])
,显然是SparseVector
,它可以分解为 3 个主要部分:size
:2000
indices
:[905,1104]
values
:[7.34977707433047,7.076179741760428]
而您要查找的是该向量的 属性
values
。对于其他 'literal' PySpark SQL 类型,例如
StringType
或IntegerType
,您可以使用以下方法访问其属性(和聚合函数) SQL 函数包 (docs)。但是vector
不是文字 SQL 类型,访问其属性的唯一方法是通过 UDF,如下所示:# Important: `vector.values` returns ndarray from numpy. # PySpark doesn't understand ndarray, therefore you'd want to # convert it to normal Python list using `tolist` def extract_values_from_vector(vector): return vector.values.tolist() # Just a regular UDF def extract_values_from_vector_udf(col): return udf(extract_values_from_vector, ArrayType(DoubleType())) # And use that UDF to get your values results.select(extract_values_from_vector_udf('features'), 'features')