spark mllib 中的文档分类

Document classification in spark mllib

我想对属于体育、娱乐、政治的文件进行分类。我创建了一个单词包,输出如下内容:

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

我想使用 Spark mllib 实现用于分类的朴素贝叶斯算法。我的问题是如何将此输出转换为朴素贝叶斯可以用作 RDD 等分类输入的内容,或者如果有任何技巧我可以直接将 html 文件转换为 mllib 朴素贝叶斯可以使用的内容.

对于文本分类,您需要:

  • 词典
  • 使用字典将文档转换为向量
  • 标记文档向量:

    doc_vec1 -> 标签 1

    doc_vec2 -> label2

    ...

这个 sample 非常简单。

    from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
    from pyspark.ml.classification import NaiveBayes

    # regular expression tokenizer
    regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words", 
    pattern="\W")
    # stop words
    add_stopwords = ["http","https","amp","rt","t","c","the"] 
    stopwordsRemover = 
  StopWordsRemover(inputCol="words",outputCol="filtered").setStopWords(add_stopwords)
    # bag of words count
    countVectors = CountVectorizer(inputCol="filtered", outputCol="features", 
    vocabSize=10000, minDF=5)
    (trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
    nb = NaiveBayes(smoothing=1)
    model = nb.fit(trainingData)
    predictions = model.transform(testData)
    predictions.filter(predictions['prediction'] == 0) \
     .select("Descript","Category","probability","label","prediction") \
     .orderBy("probability", ascending=False) \
     .show(n = 10, truncate = 30)