文本分类——如何处理

Question

我会尽量描述我的想法。

MS SQL 数据库中存储了文本内容。内容每天都以流的形式出现。有些人每天都会浏览内容，如果内容符合特定标准，则将其标记为已验证。只有一类。要么是 "valid"，要么不是。

我想要的是基于已经验证的内容创建一个模型，保存它并使用这个模型 "pre-validate" 或标记新的传入内容。也偶尔根据新验证的内容更新模型。希望我解释清楚了。

我正在考虑使用 Spark 流式处理基于创建的模型进行数据分类。和朴素贝叶斯算法。但是您将如何创建、更新和存储模型？有 ~200K+ 不同长度的验证结果（文本）。我需要这么多的模型吗？以及如何在 Spark Streaming 中使用这个模型。

提前致谢。

Answer 1

哇，这个问题非常广泛，与 Machine Learning 的关系比 Apache Spark 的关系更大，但是我会尝试给你一些提示或步骤（我不会做这项工作给你）。

导入你需要的所有库

from pyspark.mllib.classification import LogisticRegressionWithSGD, LogisticRegressionModel
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
import re

将您的数据加载到 RDD

msgs = [("I love Star Wars but I can't watch it today", 1.0),
        ("I don't love Star Wars and people want to watch it today", 0.0),
        ("I dislike not being able to watch Star Wars", 1.0),
        ("People who love Star Wars are my friends", 1.0),
        ("I preffer to watch Star Wars on Netflix", 0.0),
        ("George Lucas shouldn't have sold the franchise", 1.0),
        ("Disney makes better movies than everyone else", 0.0)]

rdd = sc.parallelize(msgs)

Tokenize 你的数据（如果你使用 ML 可能会更容易）和

rdd = rdd.map(lambda (text, label): ([w.lower() for w in re.split(" +", text)], label))

删除所有不必要的词（广为人知的停用词）和符号，例如,.&

commons = ["and", "but", "to"]
rdd = rdd.map(lambda (tokens, label): (filter(lambda token: token not in commons, tokens), label))

用你所有的数据集中的所有distinct个词创建一个字典，这听起来很大，但并不像你想象的那么多，我敢打赌它们会适合您的主节点（但是还有其他方法可以解决这个问题，但为了简单起见，我将保持这种方式）。
```
# finds different words
words = rdd.flatMap(lambda (tokens, label): tokens).distinct().collect()
diffwords = len(words)
```
将您的 features 转换为 DenseVector or SparseVector, I would obviously recommend the second way because normally a SparseVector requires less space to be represented, however it depends on the data. Note, there are better alternatives like hashing, but I am trying to keep loyal to my verbose approach. After that transform the tuple into a LabeledPoint
```
def sparsify(length, tokens):
    indices = [words.index(t) for t in set(tokens)]
    quantities = [tokens.count(words[i]) for i in indices]

    return SparseVector(length, [(indices[i], quantities[i]) for i in xrange(len(indices))])

rdd = rdd.map(lambda (tokens, label): LabeledPoint(label, sparsify(diffwords, tokens)))
```
适合自己喜欢的模型，本例我用LogisticRegressionWithSGD别有用心
```
lrm = LogisticRegressionWithSGD.train(rdd)
```
Save你的模特。
```
lrm.save(sc, "mylovelymodel.model")
```

Load your LogisticRegressionModel 在另一个应用程序中。

lrm = LogisticRegressionModel.load(sc, "mylovelymodel.model")

Predict 类别。

lrm.predict(SparseVector(37,[2,4,5,13,15,19,23,26,27,29],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))
# outputs 0

请注意，我没有评估模型的 accuracy，但是它看起来很漂亮不是吗？

文本分类——如何处理

Text classification - how to approach

machine-learning

apache-spark

apache-spark-ml

apache-spark-mllib