使用文本搭配计算 ngram 词频
Count ngram word frequency using text collocations
我想计算已转换为标记的文本文件中特定单词前后三个单词的出现频率。
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = Counter(grams)
freq.most_common(20)
我不知道如何搜索字符串 'dracula' 作为过滤词。我也试过:
text.collocations(num=100)
text.concordance('dracula')
所需的输出与计数类似:
'dracula' 前的三个单词,已排序计数
(('and', 'he', 'saw', 'dracula'), 4),
(('one', 'cannot', 'see', 'dracula'), 2)
'dracula' 之后的三个词,已排序计数
(('dracula', 'and', 'he', 'saw'), 4),
(('dracula', 'one', 'cannot', 'see'), 2)
中间含有'dracula'的八卦,排序后的个数
(('count', 'dracula', 'saw'), 4),
(('count', 'dracula', 'cannot'), 2)
提前感谢您的帮助。
一旦您获得元组格式的频率信息,就像您所做的那样,您可以简单地使用 if
语句过滤掉您要查找的词。这是使用 Python 的 列表理解 语法:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
# pulled text from here: https://archive.org/details/draculabr00stokuoft/page/n6
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = nltk.Counter(grams)
dracula_last = [item for item in freq.most_common() if item[0][3] == 'dracula']
dracula_first = [item for item in freq.most_common() if item[0][0] == 'dracula']
dracula_second = [item for item in freq.most_common() if item[0][1] == 'dracula']
# etc.
这会生成在不同位置具有 "dracula" 的列表。这是 dracula_last
的样子:
[(('the', 'castle', 'of', 'dracula'), 3),
(("'s", 'journal', '243', 'dracula'), 1),
(('carpathian', 'moun-', '2', 'dracula'), 1),
(('of', 'the', 'castle', 'dracula'), 1),
(('named', 'by', 'count', 'dracula'), 1),
(('disease', '.', 'count', 'dracula'), 1),
...]
我想计算已转换为标记的文本文件中特定单词前后三个单词的出现频率。
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = Counter(grams)
freq.most_common(20)
我不知道如何搜索字符串 'dracula' 作为过滤词。我也试过:
text.collocations(num=100)
text.concordance('dracula')
所需的输出与计数类似: 'dracula' 前的三个单词,已排序计数
(('and', 'he', 'saw', 'dracula'), 4),
(('one', 'cannot', 'see', 'dracula'), 2)
'dracula' 之后的三个词,已排序计数
(('dracula', 'and', 'he', 'saw'), 4),
(('dracula', 'one', 'cannot', 'see'), 2)
中间含有'dracula'的八卦,排序后的个数
(('count', 'dracula', 'saw'), 4),
(('count', 'dracula', 'cannot'), 2)
提前感谢您的帮助。
一旦您获得元组格式的频率信息,就像您所做的那样,您可以简单地使用 if
语句过滤掉您要查找的词。这是使用 Python 的 列表理解 语法:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
# pulled text from here: https://archive.org/details/draculabr00stokuoft/page/n6
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = nltk.Counter(grams)
dracula_last = [item for item in freq.most_common() if item[0][3] == 'dracula']
dracula_first = [item for item in freq.most_common() if item[0][0] == 'dracula']
dracula_second = [item for item in freq.most_common() if item[0][1] == 'dracula']
# etc.
这会生成在不同位置具有 "dracula" 的列表。这是 dracula_last
的样子:
[(('the', 'castle', 'of', 'dracula'), 3),
(("'s", 'journal', '243', 'dracula'), 1),
(('carpathian', 'moun-', '2', 'dracula'), 1),
(('of', 'the', 'castle', 'dracula'), 1),
(('named', 'by', 'count', 'dracula'), 1),
(('disease', '.', 'count', 'dracula'), 1),
...]