python 从零开始的朴素贝叶斯分类器?
Naive bayes classifer from scratch in python?
我为玩具数据集编写了一个简单的朴素贝叶斯分类器
msg spam
0 free home service 1
1 get free data 1
2 we live in a home 0
3 i drive the car 0
完整代码
import pandas as pd
from collections import Counter
data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)
def word_counter(word_list):
words = []
for x in word_list:
for i in x:
words.append(i)
word_count = Counter(words)
return word_count
spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam = word_counter(spam)
ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham = word_counter(ham)
total = len(spam.keys())+len(ham.keys())
# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)
new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
data = msg.split()
# Likelihood
spam_likelihood = 0.001 # low value to prevent divisional error
ham_likelihood = 0.001
for i in data:
if i in spam:
if spam_likelihood==0.001:
spam_likelihood = spam[i]/total
continue
spam_likelihood = spam[i]/total * spam_likelihood
if i in ham:
if ham_likelihood==0.001:
ham_likelihood = ham[i]/total
continue
ham_likelihood = ham[i]/total * ham_likelihood
# marginal likelihood
marginal = (spam_likelihood*spam_prior) + (ham_likelihood*ham_prior)
spam_posterior = (spam_likelihood*spam_prior)/marginal
print(msg,round(spam_posterior*100,2))
问题是我的 Spamminess
未见数据分类完全失败。
get free home service 0.07
i live in car 97.46
我预计 get free home service
的价值较高,而 i live in car
的价值较低。
我的问题是这个错误是由于缺少额外数据还是因为我的编码错误?
问题出在代码上。可能性计算不正确。
有关词袋模型下可能性的正确公式,请参阅 Wikipedia:Naive_Bayes_classifier。
当该词之前未在垃圾邮件中遇到时,您的代码的工作方式就好像似然性 p(word | spam) 为 1。使用拉普拉斯平滑,它应该是 1 / (spam_total + 1),其中 spam_total 在垃圾邮件中的总字数(有重复)。
当这个词之前在垃圾邮件中遇到过x次,应该是(x + 1) / (spam_total + 1).
我已经将计数器更改为 defaultdict 以方便处理以前没有遇到过的单词,修复了似然计算并添加了拉普拉斯平滑:
import pandas as pd
from collections import defaultdict
data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)
def word_counter(sentence_list):
word_count = defaultdict(lambda:0)
for sentence in sentence_list:
for word in sentence:
word_count[word] += 1
return word_count
spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam_total = sum([len(sentence) for sentence in spam])
spam = word_counter(spam)
ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham_total = sum([len(sentence) for sentence in ham])
ham = word_counter(ham)
# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)
new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
data = msg.split()
# Likelihood
spam_likelihood = 1
ham_likelihood = 1
for word in data:
spam_likelihood *= (spam[word] + 1) / (spam_total + 1)
ham_likelihood *= (ham[word] + 1) / (ham_total + 1)
# marginal likelihood
marginal = (spam_likelihood * spam_prior) + (ham_likelihood * ham_prior)
spam_posterior = (spam_likelihood * spam_prior) / marginal
print(msg,round(spam_posterior*100,2))
现在结果如预期:
Spamminess
get free home service 98.04
i live in car 20.65
这可以进一步改进,例如为了数值稳定性,所有这些概率的乘积应该用对数相加代替。
我为玩具数据集编写了一个简单的朴素贝叶斯分类器
msg spam
0 free home service 1
1 get free data 1
2 we live in a home 0
3 i drive the car 0
完整代码
import pandas as pd
from collections import Counter
data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)
def word_counter(word_list):
words = []
for x in word_list:
for i in x:
words.append(i)
word_count = Counter(words)
return word_count
spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam = word_counter(spam)
ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham = word_counter(ham)
total = len(spam.keys())+len(ham.keys())
# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)
new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
data = msg.split()
# Likelihood
spam_likelihood = 0.001 # low value to prevent divisional error
ham_likelihood = 0.001
for i in data:
if i in spam:
if spam_likelihood==0.001:
spam_likelihood = spam[i]/total
continue
spam_likelihood = spam[i]/total * spam_likelihood
if i in ham:
if ham_likelihood==0.001:
ham_likelihood = ham[i]/total
continue
ham_likelihood = ham[i]/total * ham_likelihood
# marginal likelihood
marginal = (spam_likelihood*spam_prior) + (ham_likelihood*ham_prior)
spam_posterior = (spam_likelihood*spam_prior)/marginal
print(msg,round(spam_posterior*100,2))
问题是我的 Spamminess
未见数据分类完全失败。
get free home service 0.07
i live in car 97.46
我预计 get free home service
的价值较高,而 i live in car
的价值较低。
我的问题是这个错误是由于缺少额外数据还是因为我的编码错误?
问题出在代码上。可能性计算不正确。 有关词袋模型下可能性的正确公式,请参阅 Wikipedia:Naive_Bayes_classifier。
当该词之前未在垃圾邮件中遇到时,您的代码的工作方式就好像似然性 p(word | spam) 为 1。使用拉普拉斯平滑,它应该是 1 / (spam_total + 1),其中 spam_total 在垃圾邮件中的总字数(有重复)。
当这个词之前在垃圾邮件中遇到过x次,应该是(x + 1) / (spam_total + 1).
我已经将计数器更改为 defaultdict 以方便处理以前没有遇到过的单词,修复了似然计算并添加了拉普拉斯平滑:
import pandas as pd
from collections import defaultdict
data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)
def word_counter(sentence_list):
word_count = defaultdict(lambda:0)
for sentence in sentence_list:
for word in sentence:
word_count[word] += 1
return word_count
spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam_total = sum([len(sentence) for sentence in spam])
spam = word_counter(spam)
ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham_total = sum([len(sentence) for sentence in ham])
ham = word_counter(ham)
# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)
new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
data = msg.split()
# Likelihood
spam_likelihood = 1
ham_likelihood = 1
for word in data:
spam_likelihood *= (spam[word] + 1) / (spam_total + 1)
ham_likelihood *= (ham[word] + 1) / (ham_total + 1)
# marginal likelihood
marginal = (spam_likelihood * spam_prior) + (ham_likelihood * ham_prior)
spam_posterior = (spam_likelihood * spam_prior) / marginal
print(msg,round(spam_posterior*100,2))
现在结果如预期:
Spamminess
get free home service 98.04
i live in car 20.65
这可以进一步改进,例如为了数值稳定性,所有这些概率的乘积应该用对数相加代替。