spark mllib 中的文档分类
Document classification in spark mllib
我想对属于体育、娱乐、政治的文件进行分类。我创建了一个单词包,输出如下内容:
(1, 'saurashtra')
(1, 'saumyajit')
(1, 'satyendra')
我想使用 Spark mllib 实现用于分类的朴素贝叶斯算法。我的问题是如何将此输出转换为朴素贝叶斯可以用作 RDD 等分类输入的内容,或者如果有任何技巧我可以直接将 html 文件转换为 mllib 朴素贝叶斯可以使用的内容.
对于文本分类,您需要:
- 词典
- 使用字典将文档转换为向量
标记文档向量:
doc_vec1 -> 标签 1
doc_vec2 -> label2
...
这个 sample 非常简单。
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import NaiveBayes
# regular expression tokenizer
regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words",
pattern="\W")
# stop words
add_stopwords = ["http","https","amp","rt","t","c","the"]
stopwordsRemover =
StopWordsRemover(inputCol="words",outputCol="filtered").setStopWords(add_stopwords)
# bag of words count
countVectors = CountVectorizer(inputCol="filtered", outputCol="features",
vocabSize=10000, minDF=5)
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
nb = NaiveBayes(smoothing=1)
model = nb.fit(trainingData)
predictions = model.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
.select("Descript","Category","probability","label","prediction") \
.orderBy("probability", ascending=False) \
.show(n = 10, truncate = 30)
我想对属于体育、娱乐、政治的文件进行分类。我创建了一个单词包,输出如下内容:
(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')
我想使用 Spark mllib 实现用于分类的朴素贝叶斯算法。我的问题是如何将此输出转换为朴素贝叶斯可以用作 RDD 等分类输入的内容,或者如果有任何技巧我可以直接将 html 文件转换为 mllib 朴素贝叶斯可以使用的内容.
对于文本分类,您需要:
- 词典
- 使用字典将文档转换为向量
标记文档向量:
doc_vec1 -> 标签 1
doc_vec2 -> label2
...
这个 sample 非常简单。
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import NaiveBayes
# regular expression tokenizer
regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words",
pattern="\W")
# stop words
add_stopwords = ["http","https","amp","rt","t","c","the"]
stopwordsRemover =
StopWordsRemover(inputCol="words",outputCol="filtered").setStopWords(add_stopwords)
# bag of words count
countVectors = CountVectorizer(inputCol="filtered", outputCol="features",
vocabSize=10000, minDF=5)
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
nb = NaiveBayes(smoothing=1)
model = nb.fit(trainingData)
predictions = model.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
.select("Descript","Category","probability","label","prediction") \
.orderBy("probability", ascending=False) \
.show(n = 10, truncate = 30)