如何将 tf-idf 与朴素贝叶斯一起使用？

Question

根据我在此处发布的关于查询的搜索，我有很多链接提出了解决方案，但没有具体说明如何完成。例如，我探索了以下链接：

Link 1

Link 2

Link 3

Link 4

等

因此，我在这里提出我对如何使用带有 tf-idf 的朴素贝叶斯公式的理解，如下所示：

朴素贝叶斯公式：

P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))

上式中可以采用tf-idf加权为：

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class) 

total_unique_words_in_all_classes : as is.

这个问题已在堆栈溢出上多次发布，但到目前为止还没有实质性的回答。我想知道我考虑问题的方式是否正确，即我上面显示的实现方式。我需要知道这一点，因为我自己实现了朴素贝叶斯，而没有借助任何 Python 库，该库带有朴素贝叶斯和 tf-idf 的内置函数。我真正想要的是提高使用朴素贝叶斯训练分类器的模型的准确性（目前为 30%）。所以，如果有更好的方法来达到良好的准确性，欢迎提出建议。

请推荐我。我是这个领域的新手。

Answer 1

如果您确实向我们提供了您想要使用的确切功能和 class，或者至少给出一个示例，那就更好了。由于 none 已经具体给出，我假设以下是您的问题：

您有多个文档，每个文档都有多个单词。
您想class将文档分类。
您的特征向量由所有文档中所有可能的词组成，并且具有每个文档中的计数值。

你的解决方案

你给的tf idf如下：

word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.

total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)

你的做法听起来很合理。所有概率的总和将与 tf-idf 函数无关，并且特征将反映 tf-idf 值。我会说这看起来像是将 tf-idf 合并到 NB 中的可靠方法。

另一种可能的解决方案

我花了一段时间才解决这个问题。这样做的主要原因是不得不担心维护概率标准化。使用高斯朴素贝叶斯有助于完全忽略这个问题。

如果您想使用此方法：

计算每个 class.
使用由上述均值和方差生成的高斯分布计算先验。
照常进行（乘以先验值）并预测值。

硬编码应该不会太难，因为 numpy 本身就具有高斯函数。对于这些类型的问题，我只是更喜欢这种通用的解决方案。

增加的其他方法

除上述方法外，您还可以使用以下技巧来提高准确性：

预处理：
1. 特征缩减（通常是 NMF、PCA 或 LDA）
2. 附加功能
算法：

朴素贝叶斯速度很快，但本质上比其他算法表现更差。进行特征缩减可能会更好，然后切换到判别模型，例如 SVM 或 Logistic Regression
其他

Bootstrapping、boosting 等。注意不要过拟合...

希望这对您有所帮助。如果有任何不清楚的地方，请发表评论

Answer 2

P(word|class)=(word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes (basically vocabulary of words in the entire training set))

这总和如何为 1？如果使用上述条件概率，我假设 SUM 是

P(word1|class)+P(word2|class)+...+P(wordn|class) = (total_words_in_class + total_unique_words_in_class)/(total_words_in_class+total_unique_words_in_all_classes)

为了纠正这个问题，我认为 P(word|class) 应该像

(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))

如有错误请指正

Answer 3

我觉得有两种方法可以做到：

将 tf-idf 舍入为整数，然后对条件概率使用多项式分布。请参阅本文 https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf。
使用 Dirichlet 分布，它是条件概率的多项式分布的连续版本。

我不确定高斯混合会不会更好

如何将 tf-idf 与朴素贝叶斯一起使用？

how to use tf-idf with Naive Bayes?

tf-idf

python-2.7

naivebayes

你的解决方案

另一种可能的解决方案

增加的其他方法