用于估计（unigram）困惑度的 NLTK 包

Question

我正在尝试计算我拥有的数据的困惑度。我使用的代码是：

 import sys
 sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")

from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm

但是我收到错误，

File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'

我已经对我拥有的数据执行了 Latent Dirichlet Allocation，并且生成了 unigrams 及其各自的概率（它们被归一化为数据的总概率之和为 1）。

我的 unigrams 及其概率如下：

Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781

这只是我拥有的 unigrams 文件的一个片段。大约 1000 行遵循相同的格式。求和的总概率（第二列）给出 1。

我是一名崭露头角的程序员。这个 ngram.py 属于 nltk 包，我很困惑如何纠正这个问题。我这里的示例代码来自 nltk 文档，我现在不知道该怎么做。请帮助我做什么。提前致谢！

Answer 1

Perplexity 是测试集的逆概率，由单词数归一化。在一元组的情况下：

现在你说你已经构建了unigram模型，也就是说，对于每个单词你都有相关的概率。然后你只需要应用公式。我假设你有一本大词典 unigram[word] 可以提供语料库中每个单词的概率。您还需要有一个测试集。如果您的 unigram 模型不是字典形式，请告诉我您使用的是什么数据结构，以便我相应地调整它以适应我的解决方案。

perplexity = 1
N = 0

for word in testset:
    if word in unigram:
        N += 1
        perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))

更新：

正如您要求的完整工作示例，这里有一个非常简单的示例。

假设这是我们的语料库：

corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""

下面是我们首先构建 unigram 模型的方法：

import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)

#here you construct the unigram language model 
def unigram(tokens):    
    model = collections.defaultdict(lambda: 0.01)
    for f in tokens:
        try:
            model[f] += 1
        except KeyError:
            model [f] = 1
            continue
    N = float(sum(model.values()))
    for word in model:
        model[word] = model[word]/N
    return model

我们这里的模型是平滑的。对于超出其知识范围的单词，它分配的概率很低0.01。我已经告诉你如何计算困惑度：

#computes perplexity of the unigram model on a testset  
def perplexity(testset, model):
    testset = testset.split()
    perplexity = 1
    N = 0
    for word in testset:
        N += 1
        perplexity = perplexity * (1/model[word])
    perplexity = pow(perplexity, 1/float(N)) 
    return perplexity

现在我们可以在两个不同的测试集上进行测试：

testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"

model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)

您得到以下结果：

>>> 
49.09452736318415
99.99999999999997

请注意，在处理困惑时，我们会尽量减少它。对于特定测试集具有较小困惑的语言模型比具有较大困惑的语言模型更可取。在第一个测试集中，单词 Monty 被包含在 unigram 模型中，因此 perplexity 的相应数字也较小。

Answer 2

感谢您提供代码段！不应该：

for word in model:
        model[word] = model[word]/float(sum(model.values()))

不如说：

v = float(sum(model.values()))
for word in model:
        model[word] = model[word]/v

哦...我看到已经回答了...

用于估计（unigram）困惑度的 NLTK 包

NLTK package to estimate the (unigram) perplexity

nlp

nltk

n-gram

python-2.7

language-model