对 HTTP post 对象进行分类的最便宜的方法

Question

我可以使用 SciPy 对计算机上的文本进行分类，但我需要实时或近乎实时地对来自 HTTP POST 请求的字符串对象进行分类。如果我的目标是高并发、近实时输出和小内存，我应该研究什么算法？我想我可以通过 Go 中的支持向量机 (SVM) 实现，但这是适合我的用例的最佳算法吗？

Answer 1

是的，SVM（具有线性内核）应该是一个很好的起点。您可以使用 scikit-learn (it wraps liblinear 我相信）来训练您的模型。学习模型后，模型只是您要分类的每个类别的 feature:weight 列表。像这样（假设你只有 3 类）：

class1[feature1] = weight11
class1[feature2] = weight12
...
class1[featurek] = weight1k    ------- for class 1

... different <feature, weight> ------ for class 2
... different <feature, weight> ------ for class 3 , etc

在预测时，您根本不需要 scikit-learn，您可以使用您在服务器后端使用的任何语言来进行线性计算。假设一个特定的POST请求包含特征（feature3，feature5），你需要做的是这样的：

linear_score[class1] = 0
linear_score[class1] += lookup weight of feature3 in class1
linear_score[class1] += lookup weight of feature5 in class1

linear_score[class2] = 0
linear_score[class2] += lookup weight of feature3 in class2
linear_score[class2] += lookup weight of feature5 in class2

..... same thing for class3
pick class1, or class2 or class3 whichever has the highest linear_score

更进一步：如果您可以通过某种方式定义特征权重（例如，使用标记的 tf-idf 分数），那么您的预测可能会变成：

linear_score[class1] += class1[feature3] x feature_weight[feature3]
so on and so forth.

注意 feature_weight[feature k] 通常每个请求都不同。由于对于每个请求，活动特征的总数必须远小于所考虑特征的总数（考虑 50 个标记或特征与 1 个 MM 标记的整个词汇表），预测应该非常快。我可以想象，一旦您的模型准备就绪，就可以基于 键值存储（例如，redis）编写预测的实现。

对 HTTP post 对象进行分类的最便宜的方法

Cheapest way to classify HTTP post objects

algorithm

machine-learning

svm

go

text-classification