PHP TNTClassifier似然概率分布
PHP TNTClassifier likelihood probability distribution
我正在使用 TNT 搜索文本分类模块,
https://github.com/teamtnt/tntsearch, and it works good, the problem is I do not know how to interpret the results - more specifically the likelihood of correct match. I have read that it uses Naive Bayes classifier 但我找不到结果是哪种概率分布。我有自己的大约 50 个值(50 / 10 = 5 个类别)的小型测试数据集,猜测相当正确。
但是,此工具提供的似然数是一个负数,大约在 -15 到 -25 的范围内。
问题是,什么值可以被解释为不可信?假设该工具只有 <33% 的把握。这个假设对应什么值?
我已与 TNTSearch 开发人员取得联系。 classifier 实际上不是 return 概率而是 "highest score"。并且只为最佳匹配。
根据提示,我对代码做了一些修改。
在 class TeamTNT\TNTSearch\Classifier\TNTClassifier
中,我更改了 predict
方法中的位(受 here 启发的 softmax 函数):
public function predict($statement)
{
$words = $this->tokenizer->tokenize($statement);
$best_likelihoods = [];
$best_likelihood = -INF;
$best_type = '';
foreach ($this->types as $type) {
$best_likelihoods[$type] = -INF;
$likelihood = log($this->pTotal($type)); // calculate P(Type)
$p = 0;
foreach ($words as $word) {
$word = $this->stemmer->stem($word);
$p += log($this->p($word, $type));
}
$likelihood += $p; // calculate P(word, Type)
if ($likelihood > $best_likelihood) {
$best_likelihood = $likelihood;
$best_likelihoods[$type] = $likelihood;
$best_type = $type;
}
}
return [
'likelihood' => $best_likelihood,
'likelihoods' => $best_likelihoods,
'probability' => $this->softmax($best_likelihoods),
'label' => $best_type
];
}
然后可以在$guess['probability']['$label']
中找到百分比概率。
我正在使用 TNT 搜索文本分类模块, https://github.com/teamtnt/tntsearch, and it works good, the problem is I do not know how to interpret the results - more specifically the likelihood of correct match. I have read that it uses Naive Bayes classifier 但我找不到结果是哪种概率分布。我有自己的大约 50 个值(50 / 10 = 5 个类别)的小型测试数据集,猜测相当正确。
但是,此工具提供的似然数是一个负数,大约在 -15 到 -25 的范围内。
问题是,什么值可以被解释为不可信?假设该工具只有 <33% 的把握。这个假设对应什么值?
我已与 TNTSearch 开发人员取得联系。 classifier 实际上不是 return 概率而是 "highest score"。并且只为最佳匹配。
根据提示,我对代码做了一些修改。
在 class TeamTNT\TNTSearch\Classifier\TNTClassifier
中,我更改了 predict
方法中的位(受 here 启发的 softmax 函数):
public function predict($statement)
{
$words = $this->tokenizer->tokenize($statement);
$best_likelihoods = [];
$best_likelihood = -INF;
$best_type = '';
foreach ($this->types as $type) {
$best_likelihoods[$type] = -INF;
$likelihood = log($this->pTotal($type)); // calculate P(Type)
$p = 0;
foreach ($words as $word) {
$word = $this->stemmer->stem($word);
$p += log($this->p($word, $type));
}
$likelihood += $p; // calculate P(word, Type)
if ($likelihood > $best_likelihood) {
$best_likelihood = $likelihood;
$best_likelihoods[$type] = $likelihood;
$best_type = $type;
}
}
return [
'likelihood' => $best_likelihood,
'likelihoods' => $best_likelihoods,
'probability' => $this->softmax($best_likelihoods),
'label' => $best_type
];
}
然后可以在$guess['probability']['$label']
中找到百分比概率。