在朴素贝叶斯实现中难以获得正确的后验值

Difficulties to get the correct posterior value in a Naive Bayes Implementation

出于学习目的,我尝试使用 python 但“不使用”sckitlearn 或类似的东西来实施 this“课程”。

我的尝试代码如下:

import pandas, math

training_data = [
        ['A great game','Sports'],
        ['The election was over','Not sports'],
        ['Very clean match','Sports'],
        ['A clean but forgettable game','Sports'],
        ['It was a close election','Not sports']
]

text_to_predict = 'A very close game'
data_frame = pandas.DataFrame(training_data, columns=['data','label'])
data_frame = data_frame.applymap(lambda s:s.lower() if type(s) == str else s)
text_to_predict = text_to_predict.lower()
labels = data_frame.label.unique()
word_frequency = data_frame.data.str.split(expand=True).stack().value_counts()
unique_words_set = set()
unique_words = data_frame.data.str.split().apply(unique_words_set.update)
total_unique_words = len(unique_words_set)

word_frequency_per_labels = []
for l in labels:
    word_frequency_per_label = data_frame[data_frame.label == l].data.str.split(expand=True).stack().value_counts()
    for w, f in word_frequency_per_label.iteritems():
        word_frequency_per_labels.append([w,f,l])

word_frequency_per_labels_df = pandas.DataFrame(word_frequency_per_labels, columns=['word','frequency','label'])
laplace_smoothing = 1
results = []
for l in labels:
    p = []
    total_words_in_label = word_frequency_per_labels_df[word_frequency_per_labels_df.label == l].frequency.sum()
    for w in text_to_predict.split():
        x = (word_frequency_per_labels_df.query('word == @w and label == @l').frequency.to_list()[:1] or [0])[0]
        p.append((x + laplace_smoothing) / (total_words_in_label + total_unique_words))
    results.append([l,math.prod(p)])

print(results)
result = pandas.DataFrame(results, columns=['labels','posterior']).sort_values('posterior',ascending = False).labels.iloc[0]
print(result)

在博客课上他们的成绩是:

但我的结果是:

[['sports', 4.607999999999999e-05], ['not sports', 1.4293831139825827e-05]]

那么,我在 python 实施中做错了什么?我怎样才能得到相同的结果?

提前致谢

您还没有乘以先验 p(Sport) = 3/5p(Not Sport) = 2/5。因此,只需按这些比率更新您的答案,您就会得到正确的结果。其他一切看起来都不错。

例如,您在 math.prod(p) 计算中实现了 p(a|Sports) x p(very|Sports) x p(close|Sports) x p(game|Sports),但这会忽略术语 p(Sport)。所以添加这个(并在非运动条件下做同样的事情)可以解决问题。

在代码中,这可以通过以下方式实现:

prior = (data_frame.label == l).mean()
results.append([l,prior*math.prod(p)])

@nick 的回答是正确的,应该被授予赏金。

这里有一个替代实现(从头开始,不使用 pandas),它也支持概率和不在训练集中的词的归一化

from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict, Set

def tokenize(text: str):
    return [word.lower() for word in text.split()]

def normalize(result: Dict[str, float]):
    total = sum([v for v in result.values()])
    for k in result.keys():
        result[k] /= total

@dataclass
class Model:
    labels: Set[str] = field(default_factory=set)
    words: Set[str] = field(default_factory=set)
    prob_labels: Dict[str,float] = field(default_factory=lambda: defaultdict(float)) # P(label)
    prob_words: Dict[str,Dict[str,float]] = field(default_factory=lambda: defaultdict(lambda: defaultdict(float)))  # P(word | label) as prob_words[label][word]

    
    def predict(self, text: str, norm=True) -> Dict[str, float]: # P(label | text) as model.predict(text)[label]
        result = {label: self.prob_labels[label] for label in self.labels}
        for word in tokenize(text):
            for label in self.labels:
                if word in self.words:
                    result[label] *= self.prob_words[label][word]
        if norm:
            normalize(result)
        return result

    def train(self, data):
        prob_words_denominator = defaultdict(int)
        for row in data:
            text = row[0]
            label = row[1].lower()
            self.labels.add(label)
            self.prob_labels[label] += 1.0
            for word in tokenize(text):
                self.words.add(word)
                self.prob_words[label][word] += 1.0
                prob_words_denominator[label] += 1.0
        for label in self.labels:
            self.prob_labels[label] /= len(data)
            for word in self.words:
                self.prob_words[label][word] = (self.prob_words[label][word] + 1.0) / (prob_words_denominator[label] + len(self.words))
            
            
training_data = [
        ['A great game','Sports'],
        ['The election was over','Not sports'],
        ['Very clean match','Sports'],
        ['A clean but forgettable game','Sports'],
        ['It was a close election','Not sports']
]

text_to_predict = 'A very close game'

model = Model()
model.train(training_data)
print(model.predict(text_to_predict, norm=False))
print(model.predict(text_to_predict))
print(model.predict("none of these words is in training data"))

输出:

{'sports': 2.7647999999999997e-05, 'not sports': 5.7175324559303314e-06}
{'sports': 0.8286395560004286, 'not sports': 0.1713604439995714}
{'sports': 0.6, 'not sports': 0.4}