使用 Apache Spark 和 mllib 生成关键字
Generate keywords using Apache Spark and mllib
我写的代码是这样的:
val hashingTF = new HashingTF()
val tfv: RDD[Vector] = sparkContext.parallelize(articlesList.map { t => hashingTF.transform(t.words) })
tfv.cache()
val idf = new IDF().fit(tfv)
val rate: RDD[Vector] = idf.transform(tfv)
如何从每个 articlesList 项目的 "rate" RDD 中获取前 5 个关键字?
添加:
articlesList 包含对象:
case class ArticleInfo (val url: String, val author: String, val date: String, val keyWords: List[String], val words: List[String])
words 包含文章中的所有单词。
我不明白rate的结构,文档中说:
@return an RDD of TF-IDF vectors
我的解决方案是:
(articlesList, rate.collect()).zipped.foreach { (art,tfidf) =>
val keywords = new mutable.TreeSet[(String, Double)]
art.words.foreach { word =>
val wordHash = hashingTF.indexOf(word)
val wordTFIDF = tfidf.apply(wordHash)
if (keywords.size == KEYWORD_COUNT) {
val minimum = keywords.minBy(_._2)
if (minimum._2 < wordHash) {
keywords.remove(minimum)
keywords.add((word,wordTFIDF))
}
} else {
keywords.add((word,wordTFIDF))
}
}
art.keyWords = keywords.toList.map(_._1)
}
我写的代码是这样的:
val hashingTF = new HashingTF()
val tfv: RDD[Vector] = sparkContext.parallelize(articlesList.map { t => hashingTF.transform(t.words) })
tfv.cache()
val idf = new IDF().fit(tfv)
val rate: RDD[Vector] = idf.transform(tfv)
如何从每个 articlesList 项目的 "rate" RDD 中获取前 5 个关键字?
添加:
articlesList 包含对象:
case class ArticleInfo (val url: String, val author: String, val date: String, val keyWords: List[String], val words: List[String])
words 包含文章中的所有单词。
我不明白rate的结构,文档中说:
@return an RDD of TF-IDF vectors
我的解决方案是:
(articlesList, rate.collect()).zipped.foreach { (art,tfidf) =>
val keywords = new mutable.TreeSet[(String, Double)]
art.words.foreach { word =>
val wordHash = hashingTF.indexOf(word)
val wordTFIDF = tfidf.apply(wordHash)
if (keywords.size == KEYWORD_COUNT) {
val minimum = keywords.minBy(_._2)
if (minimum._2 < wordHash) {
keywords.remove(minimum)
keywords.add((word,wordTFIDF))
}
} else {
keywords.add((word,wordTFIDF))
}
}
art.keyWords = keywords.toList.map(_._1)
}