使用 NLTK 在语料库中查找单词列表?找不到词频

Finding a list of words in a corpus using NLTK? Cannot find the frequency of words

我已经下载了语料库并对单词进行了标记。我有一个主要角色列表,我想知道每个名字在语料库中出现了多少次。我试过用字典使用频率函数,但我不知道如何获得名称计数..

target_url0 = 'http://www.gutenberg.org/files/135/135-0.txt'
book_raw = urlopen(target_url0).read().decode('utf-8')
word_tokens = word_tokenize(book_raw)

character_list = ['Myriel','Bishop','Baptistine','Magloire','Cravatte','Valjean','Gervais','Fantine','Tholomyès'
                  ,'Blachevelle','Dahlia','Fameuil','Favourite','Listolier','Zéphine','Cosette','Thénardier',
                  'Éponine','Azelma','Javert','Fauchelevent','Bamatabois','Champmathieu',
                  'Brevet','Simplice','Chenildieu','Cochepaille','Innocente','Reverend','Ascension','Crucifixion',
                  'Gavroche','Magnon',
                  'Gillenormand','Marius','Colonel','Mabeuf','Enjolras','Combeferre','Prouvaire',
                 'Feuilly','Courfeyrac','Bahorel','Lesgle','Joly','Grantaire','Patron-Minette','Brujon',
                 'Toussaint'] 


fdist_mis = FreqDist(word_tokens)

filtered_word_freq = dict((character_list, freq) for character_list, freq in fdist_mis.items())

当我探索 filtered_word_freqt 时,它只是 returns 所有单词标记,而不是唯一字符及其出现的字典。有什么帮助吗?非常感谢。

您想如何查看频率?您可以获得每个单词被看到的 # 次的计数,或者整个文本中出现频率的比率,甚至是花哨的格式 table。相关函数复制自here:

N()[source]
Return the total number of sample outcomes that have been recorded by this FreqDist. For the number of unique sample values (or bins) with counts greater than zero, use FreqDist.B().
Return type:    int

freq(sample)[source]
Return the frequency of a given sample. The frequency of a sample is defined as the count of that sample divided by the total number of sample outcomes that have been recorded by this FreqDist. The count of a sample is defined as the number of times that sample outcome was recorded by this FreqDist. Frequencies are always real numbers in the range [0, 1].

tabulate(*args, **kwargs)[source]
Tabulate the given samples from the frequency distribution (cumulative), displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted.
Parameters: samples (list) – The samples to plot (default is all samples)

这是我的代码版本(包括 using 语句和供未来读者使用的完全限定调用):

import urllib.request
import nltk
nltk.download("punkt")

target_url0 = 'http://www.gutenberg.org/files/135/135-0.txt'
book_raw = urllib.request.urlopen(target_url0).read().decode('utf-8')
word_tokens = nltk.word_tokenize(book_raw)

character_list = ['Myriel','Bishop','Baptistine','Magloire','Cravatte','Valjean','Gervais','Fantine','Tholomyès','Blachevelle','Dahlia','Fameuil','Favourite','Listolier','Zéphine','Cosette','Thénardier','Éponine','Azelma','Javert','Fauchelevent','Bamatabois','Champmathieu','Brevet','Simplice','Chenildieu','Cochepaille','Innocente','Reverend','Ascension','Crucifixion','Gavroche','Magnon','Gillenormand','Marius','Colonel','Mabeuf','Enjolras','Combeferre','Prouvaire','Feuilly','Courfeyrac','Bahorel','Lesgle','Joly','Grantaire','Patron-Minette','Brujon','Toussaint'] 
fdist_mis = nltk.FreqDist(word_tokens)

然后我可以使用该词典来获取任何单词的频率。例如

.
>>> fdist_mis["Myriel"]
28
>>> fdist_mis["Bishop"]
260