NLTK FreqDist,绘制归一化计数?
NLTK FreqDist, plot the normalised counts?
在 NLTK 中,您可以轻松地计算文本中单词的计数,例如,通过
from nltk.probability import FreqDist
fd = FreqDist([word for word in text.split()])
其中文本是一个字符串。
现在,您可以将分布绘制为
fd.plot()
这将为您提供一个漂亮的线图,其中包含每个单词的计数。在 docs 中没有提到绘制实际频率的方法,您可以在 fd.freq(x)
中看到。
有没有直接绘制标准化计数的方法,无需将数据放入其他数据结构,单独标准化和绘图?
请原谅缺少文档。在 nltk
中,FreqDist
为您提供文本中的原始计数(即单词的频率),但 ProbDist
为您提供给定文本中某个单词的概率。
要了解更多信息,您必须阅读一些代码:https://github.com/nltk/nltk/blob/develop/nltk/probability.py
规范化的特定行来自 https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L598
因此,要获得标准化 ProbDist
,您可以执行以下操作:
>>> from nltk.corpus import brown
>>> from nltk.probability import FreqDist
>>> from nltk.probability import DictionaryProbDist
>>> brown_freqdist = FreqDist(brown.words())
# Cast the frequency distribution into probabilities
>>> brown_probdist = DictionaryProbDist(brown_freqdist)
# Something strange in NLTK to note though
# When asking for probabilities in a ProbDist without
# normalization, it looks it returns the count instead...
>>> brown_freqdist['said']
1943
>>> brown_probdist.prob('said')
1943
>>> brown_probdist.logprob('said')
10.924070185585345
>>> brown_probdist = DictionaryProbDist(brown_freqdist, normalize=True)
>>> brown_probdist.logprob('said')
-9.223104921442907
>>> brown_probdist.prob('said')
0.0016732805599763002
您可以更新 fd[word] 为 fd[word] / total
from nltk.probability import FreqDist
text = "This is an example . This is test . example is for freq dist ."
fd = FreqDist([word for word in text.split()])
total = fd.N()
for word in fd:
fd[word] /= float(total)
fd.plot()
注意:您将丢失原始 FreqDist 值。
在 NLTK 中,您可以轻松地计算文本中单词的计数,例如,通过
from nltk.probability import FreqDist
fd = FreqDist([word for word in text.split()])
其中文本是一个字符串。 现在,您可以将分布绘制为
fd.plot()
这将为您提供一个漂亮的线图,其中包含每个单词的计数。在 docs 中没有提到绘制实际频率的方法,您可以在 fd.freq(x)
中看到。
有没有直接绘制标准化计数的方法,无需将数据放入其他数据结构,单独标准化和绘图?
请原谅缺少文档。在 nltk
中,FreqDist
为您提供文本中的原始计数(即单词的频率),但 ProbDist
为您提供给定文本中某个单词的概率。
要了解更多信息,您必须阅读一些代码:https://github.com/nltk/nltk/blob/develop/nltk/probability.py
规范化的特定行来自 https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L598
因此,要获得标准化 ProbDist
,您可以执行以下操作:
>>> from nltk.corpus import brown
>>> from nltk.probability import FreqDist
>>> from nltk.probability import DictionaryProbDist
>>> brown_freqdist = FreqDist(brown.words())
# Cast the frequency distribution into probabilities
>>> brown_probdist = DictionaryProbDist(brown_freqdist)
# Something strange in NLTK to note though
# When asking for probabilities in a ProbDist without
# normalization, it looks it returns the count instead...
>>> brown_freqdist['said']
1943
>>> brown_probdist.prob('said')
1943
>>> brown_probdist.logprob('said')
10.924070185585345
>>> brown_probdist = DictionaryProbDist(brown_freqdist, normalize=True)
>>> brown_probdist.logprob('said')
-9.223104921442907
>>> brown_probdist.prob('said')
0.0016732805599763002
您可以更新 fd[word] 为 fd[word] / total
from nltk.probability import FreqDist
text = "This is an example . This is test . example is for freq dist ."
fd = FreqDist([word for word in text.split()])
total = fd.N()
for word in fd:
fd[word] /= float(total)
fd.plot()
注意:您将丢失原始 FreqDist 值。