是否有根据斯坦福 NLP 研究论文在 scikit-learn Multinomial Naive Bayes 中提取最大后验概率?
Is there anyway to extract Maximum A Posteriori in scikit-learn Multinomial Naive Bayes based on the Stanford NLP research paper?
我正在尝试复制 link
中的论文结果
https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
此 link 解释了多项式朴素贝叶斯如何用于文本分类。
我尝试使用 scikit learn 重现该示例。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.naive_bayes import MultinomialNB
#TRAINING SET
dftrain = pd.DataFrame(data=np.array([["Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"],
["yes", "yes", "yes", "no"]]))
dftrain = dftrain.T
dftrain.columns = ['text', 'label']
#TEST SET
dftest = pd.DataFrame(data=np.array([["Chinese Chinese Chinese Tokyo Japan"]]))
dftest.columns = ['text']
count_vectorizer = CountVectorizer(min_df=0, token_pattern=r"\b\w+\b", stop_words = None)
count_train = count_vectorizer.fit_transform(dftrain['text'])
count_test = count_vectorizer.transform(dftest['text'])
clf = MultinomialNB()
clf.fit(count_train, df['label'])
clf.predict(count_test)
输出正确打印为:
array(['yes'],
dtype='<U3')
就像论文里说的一样!
该论文预测为是,因为
P(yes | test set) = 0.0003 > P(no | test set) = 0.0001
我希望能够看到那两个概率!
当我输入时:
clf.predict_proba(count_test)
我明白了
array([[ 0.31024139, 0.68975861]])
我认为这是什么意思:
P(test belongs to label 'no') = 0.31024139
和 P(test belongs to label 'yes') = 0.68975861
因此,scikit-learn
预测文本属于标签 yes
,但
我的问题是:为什么概率不同? P(yes | test set) = 0.0003 > P(no | test set) = 0.0001
,我没有看到数字 0.0003
和 0.0001
,而是看到了 0.31024139
和 0.68975861
我是不是漏掉了什么?这与 class_prior
参数有关吗?
我确实阅读了文档!
http://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes
显然,参数是通过最大似然的平滑版本估计的,即相对频率计数。
我想知道的是,无论如何,我可以复制并看到研究论文中的结果吗?
这个更多的是和predict_proba
产生的概率的意义有关。数字 .0003 和 .0001 未标准化,即它们总和不为一。如果您将这些值归一化,您将得到相同的结果
查看下面的代码片段:
clf.predict_proba(count_test)
Out[63]: array([[ 0.31024139, 0.68975861]])
In [64]: p = (3/4)*((3/7)**3)*(1/14)*(1/14)
In [65]: p
Out[65]: 0.00030121377997263036
In [66]: p0 = (1/4)*((2/9)**3)*(2/9)*(2/9)
In [67]: p0
Out[67]: 0.00013548070246744223
#normalised values
In [68]: p/(p0+p)
Out[68]: 0.6897586117634674
In [69]: p0/(p0+p)
Out[69]: 0.3102413882365326
我正在尝试复制 link
中的论文结果https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
此 link 解释了多项式朴素贝叶斯如何用于文本分类。
我尝试使用 scikit learn 重现该示例。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.naive_bayes import MultinomialNB
#TRAINING SET
dftrain = pd.DataFrame(data=np.array([["Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"],
["yes", "yes", "yes", "no"]]))
dftrain = dftrain.T
dftrain.columns = ['text', 'label']
#TEST SET
dftest = pd.DataFrame(data=np.array([["Chinese Chinese Chinese Tokyo Japan"]]))
dftest.columns = ['text']
count_vectorizer = CountVectorizer(min_df=0, token_pattern=r"\b\w+\b", stop_words = None)
count_train = count_vectorizer.fit_transform(dftrain['text'])
count_test = count_vectorizer.transform(dftest['text'])
clf = MultinomialNB()
clf.fit(count_train, df['label'])
clf.predict(count_test)
输出正确打印为:
array(['yes'],
dtype='<U3')
就像论文里说的一样! 该论文预测为是,因为
P(yes | test set) = 0.0003 > P(no | test set) = 0.0001
我希望能够看到那两个概率!
当我输入时:
clf.predict_proba(count_test)
我明白了
array([[ 0.31024139, 0.68975861]])
我认为这是什么意思:
P(test belongs to label 'no') = 0.31024139
和 P(test belongs to label 'yes') = 0.68975861
因此,scikit-learn
预测文本属于标签 yes
,但
我的问题是:为什么概率不同? P(yes | test set) = 0.0003 > P(no | test set) = 0.0001
,我没有看到数字 0.0003
和 0.0001
,而是看到了 0.31024139
和 0.68975861
我是不是漏掉了什么?这与 class_prior
参数有关吗?
我确实阅读了文档!
http://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes
显然,参数是通过最大似然的平滑版本估计的,即相对频率计数。
我想知道的是,无论如何,我可以复制并看到研究论文中的结果吗?
这个更多的是和predict_proba
产生的概率的意义有关。数字 .0003 和 .0001 未标准化,即它们总和不为一。如果您将这些值归一化,您将得到相同的结果
查看下面的代码片段:
clf.predict_proba(count_test)
Out[63]: array([[ 0.31024139, 0.68975861]])
In [64]: p = (3/4)*((3/7)**3)*(1/14)*(1/14)
In [65]: p
Out[65]: 0.00030121377997263036
In [66]: p0 = (1/4)*((2/9)**3)*(2/9)*(2/9)
In [67]: p0
Out[67]: 0.00013548070246744223
#normalised values
In [68]: p/(p0+p)
Out[68]: 0.6897586117634674
In [69]: p0/(p0+p)
Out[69]: 0.3102413882365326