使用 Python NLTK 对 trigrams 进行 Kneser-Ney 平滑

Question

我正在尝试使用 Python NLTK 使用 Kneser-Ney 平滑来平滑一组 n-gram 概率。不幸的是，整个文档相当稀疏。

我想做的是：我将文本解析为三元组列表。从这个列表中，我创建了一个 FreqDist，然后使用该 FreqDist 来计算 KN 平滑分布。

不过我很确定，结果是完全错误的。当我总结各个概率时，我得到的结果远远超过 1。以这个代码为例：

import nltk

ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!")

freq_dist = nltk.FreqDist(ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
prob_sum = 0
for i in kneser_ney.samples():
    prob_sum += kneser_ney.prob(i)
print(prob_sum)

输出为“41.51696428571428”。根据语料库的大小，这个值会无限大。这使得任何 prob() returns 都不是概率分布。

查看 NLTK 代码，我会说实现是有问题的。也许我只是不明白代码应该如何使用。在那种情况下，你能给我一个提示吗？在任何其他情况下：你知道任何有效的 Python 实现吗？我真的不想自己实现它。

Answer 1

Kneser-Ney (also have a look at Goodman and Chen for a great survey on different smoothing techniques) is a quite complicated smoothing which only a few package that I am aware of got it right. Not aware of any python implementation, but you can definitely try SRILM 如果你只需要概率等

您的样本很可能包含训练数据中未出现的单词（又名 词汇外 (OOV) 单词），如果不进行处理适当地会弄乱你得到的概率。也许这会导致变得异常大和无效的概率？

Answer 2

回答您的其他问题：

In any other case: do you know any working Python implementation?

我刚刚在 Python 完成了 Kneser-Ney 的实现。代码是here； README 中也有报告。如有任何疑问，请写信给我。

Answer 3

我认为您误解了 Kneser-Ney 计算的内容。

来自Wikipedia:

The normalizing constant λ_{w_i-1} has value chosen carefully to make the sum of conditional probabilities p_KN(w_i|w_i-1) equal to one.

当然，我们在这里谈论的是双字母组，但同样的原则也适用于高阶模型。基本上这句话的意思是，对于一个固定的上下文 w_i-1 （或更高阶模型的更多上下文）所有 w_{i[=32= 的概率] 必须加起来为一。当您将所有样本的概率相加时，您所做的是包括多个上下文，这就是为什么您最终得到一个大于 1 的 "probability" 的原因。如果您保持上下文固定，如以下代码示例所示，您最终得到一个 <= 1.}

的数字



    from nltk.util import ngrams
    from nltk.corpus import gutenberg

    gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left = True, pad_right = True, right_pad_symbol='EOS', left_pad_symbol="BOS"))
    freq_dist = nltk.FreqDist(gut_ngrams)
    kneser_ney = nltk.KneserNeyProbDist(freq_dist)

    prob_sum = 0
    for i in kneser_ney.samples():
        if i[0] == "I" and i[1] == "confess":
            prob_sum += kneser_ney.prob(i)
            print "{0}:{1}".format(i, kneser_ney.prob(i))
    print prob_sum

基于 NLTK Gutenberg 语料库子集的输出如下。



    (u'I', u'confess', u'.--'):0.00657894736842
    (u'I', u'confess', u'what'):0.00657894736842
    (u'I', u'confess', u'myself'):0.00657894736842
    (u'I', u'confess', u'also'):0.00657894736842
    (u'I', u'confess', u'there'):0.00657894736842
    (u'I', u'confess', u',"'):0.0328947368421
    (u'I', u'confess', u'that'):0.164473684211
    (u'I', u'confess', u'"--'):0.00657894736842
    (u'I', u'confess', u'it'):0.0328947368421
    (u'I', u'confess', u';'):0.00657894736842
    (u'I', u'confess', u','):0.269736842105
    (u'I', u'confess', u'I'):0.164473684211
    (u'I', u'confess', u'unto'):0.00657894736842
    (u'I', u'confess', u'is'):0.00657894736842
    0.723684210526

这个和(.72)小于1的原因是概率只针对语料库中出现的三元组进行计算，其中第一个词是"I"，第二个词是"confess." 剩下的0.28概率留给语料库中不跟在"I"和"confess"之后的w_is。这是平滑的全部要点，将出现在语料库中的 ngram 的一些概率质量重新分配给那些没有出现的 ngram，这样你就不会得到一堆 0 概率的 ngram。

也不行



    ngrams = nltk.trigrams("What a piece of work is man! how noble in reason! how infinite in faculty! in \
    form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
    the beauty of the world, the paragon of animals!")

算八卦？我认为这需要被标记化以计算单词三元组。

使用 Python NLTK 对 trigrams 进行 Kneser-Ney 平滑

Kneser-Ney smoothing of trigrams using Python NLTK

python

nlp

nltk

smoothing