python 从零开始的朴素贝叶斯分类器?

Naive bayes classifer from scratch in python?

我为玩具数据集编写了一个简单的朴素贝叶斯分类器

                 msg  spam
0  free home service     1
1      get free data     1
2  we live in a home     0
3    i drive the car     0

完整代码

import pandas as pd
from collections import Counter

data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)

def word_counter(word_list):
    words = []
    for x in word_list:
        for i in x:
            words.append(i)
    
    word_count = Counter(words)
    return word_count

spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam = word_counter(spam)

ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham = word_counter(ham)

total = len(spam.keys())+len(ham.keys())

# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)

new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
    data = msg.split()
    
    # Likelihood
    spam_likelihood = 0.001 # low value to prevent divisional error
    ham_likelihood = 0.001
    for i in data:
        if i in spam:
            if spam_likelihood==0.001:
                spam_likelihood = spam[i]/total
                continue
            spam_likelihood = spam[i]/total * spam_likelihood
        if i in ham:
            if ham_likelihood==0.001:
                ham_likelihood = ham[i]/total
                continue
            ham_likelihood = ham[i]/total * ham_likelihood
    
    # marginal likelihood
    marginal = (spam_likelihood*spam_prior) + (ham_likelihood*ham_prior)
    
    spam_posterior = (spam_likelihood*spam_prior)/marginal
    print(msg,round(spam_posterior*100,2))

问题是我的 Spamminess 未见数据分类完全失败。

get free home service 0.07
i live in car 97.46

我预计 get free home service 的价值较高,而 i live in car 的价值较低。

我的问题是这个错误是由于缺少额外数据还是因为我的编码错误?

问题出在代码上。可能性计算不正确。 有关词袋模型下可能性的正确公式,请参阅 Wikipedia:Naive_Bayes_classifier

当该词之前未在垃圾邮件中遇到时,您的代码的工作方式就好像似然性 p(word | spam) 为 1。使用拉普拉斯平滑,它应该是 1 / (spam_total + 1),其中 spam_total 在垃圾邮件中的总字数(有重复)。

当这个词之前在垃圾邮件中遇到过x次,应该是(x + 1) / (spam_total + 1).

我已经将计数器更改为 defaultdict 以方便处理以前没有遇到过的单词,修复了似然计算并添加了拉普拉斯平滑:

import pandas as pd
from collections import defaultdict

data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)

def word_counter(sentence_list):
    word_count = defaultdict(lambda:0)
    for sentence in sentence_list:
        for word in sentence:
            word_count[word] += 1
    return word_count

spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam_total = sum([len(sentence) for sentence in spam])
spam = word_counter(spam)

ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham_total = sum([len(sentence) for sentence in ham])
ham = word_counter(ham)

# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)

new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
    data = msg.split()
    
    # Likelihood
    spam_likelihood = 1
    ham_likelihood = 1
    for word in data:
        spam_likelihood *= (spam[word] + 1) / (spam_total + 1)
        ham_likelihood *= (ham[word] + 1) / (ham_total + 1)
    
    # marginal likelihood
    marginal = (spam_likelihood * spam_prior) + (ham_likelihood * ham_prior)
    
    spam_posterior = (spam_likelihood * spam_prior) / marginal
    print(msg,round(spam_posterior*100,2))

现在结果如预期:

    Spamminess
get free home service 98.04
i live in car 20.65

这可以进一步改进,例如为了数值稳定性,所有这些概率的乘积应该用对数相加代替。