使用带有 NLTK 的朴素贝叶斯将文本字符串分类为多个类

Question

我目前正在使用朴素贝叶斯对一堆文本进行分类。我有多个类别。现在我只输出后验概率和类别，但我想做的是根据后验概率对类别进行排序，并将第 2、3 名类别用作 "back up" 个类别。

这是一个例子：

df = pandas.DataFrame({ 'text' : pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]), 'true_cat' : pandas.Categorical(["bird","plane","bird","plane"])})

text           true_cat
-----------------------
I have wings   bird
Metal wings    plane
Feathers       bird
Airport        plane

我在做什么：

new_cat = classifier.classify(features(text))
prob_cat = classifier.prob_classify(features(text))

最终输出：

new_cat prob_cat    text           true_cat
bird    0.67        I have wings   bird
bird    0.6         Feathers       bird
bird    0.51        Metal wings    plane
plane   0.8         Airport        plane

我找到了几个使用 classify_many 和 prob_classify_many 的例子，但因为我是新手到 Python 我无法将其转化为我的问题。我还没有在任何地方看到它与 pandas 一起使用。

我希望它看起来像这样：

df_new = pandas.DataFrame({'text': pandas.Categorical(["I have wings","Metal wings","Feathers","Airport"]),'true_cat': pandas.Categorical(["bird","plane","bird","plane"]), 'new_cat1': pandas.Categorical(["bird","bird","bird","plane"]), 'new_cat2': pandas.Categorical(["plane","plane","plane","bird"]), 'prob_cat1': pandas.Categorical(["0.67","0.51","0.6","0.8"]), 'prob_cat2': pandas.Categorical(["0.33","0.49","0.4","0.2"])})


new_cat1    new_cat2    prob_cat1   prob_cat2   text           true_cat
-----------------------------------------------------------------------
bird        plane       0.67        0.33        I have wings   bird
bird        plane       0.51        0.49        Metal wings    plane
bird        plane       0.6         0.4         Feathers       bird
plane       bird        0.8         0.2         Airport        plane

如有任何帮助，我们将不胜感激。

Answer 1

我现在开始了。

#This gives me the probability it's a bird.
prob_cat.prob(bird)

#This gives me the probability it's a plane.
prob_cat.prob(plane)

现在因为我有几十个类别，所以我正在研究一种好方法让它给我所有类别而不输入所有类别名称，但这应该非常简单。

Answer 2

我将您的自我回答视为问题的一部分。想必你得到的分类概率bird是这样的：

prob_cat.prob("bird")

这里，prob_cat是一个nltk概率分布（ProbDist）。您可以获得离散 ProbDist 中的所有类别及其概率，如下所示：

probs = list((x, prob_cat.prob(x)) for x in prob_cat.samples())

由于您已经知道训练的类别，因此可以使用预定义列表而不是 prob_cat.samples()。最后，您可以在同一表达式中从最可能到最不可能对它们进行排序：

mycategories = ["bird", "plane"]
probs = sorted(((x, prob_cat.prob(x)) for x in mycategories), key=lambda tup: -tup[1])

使用带有 NLTK 的朴素贝叶斯将文本字符串分类为多个类

Classifying text strings into multiple classes using Naive Bayes with NLTK

python

nltk

pandas

naivebayes

使用带有 NLTK 的朴素贝叶斯将文本字符串分类为多个 类

Classifying text strings into multiple classes using Naive Bayes with NLTK

python

nltk

pandas

naivebayes

使用带有 NLTK 的朴素贝叶斯将文本字符串分类为多个类