访问包含 ngram 的计数器的元素

accessing elements of a counter containing ngrams

我正在获取一个字符串,对其进行标记化,并想查看最常见的二元语法,这是我得到的:

import nltk
import collections
from nltk import ngrams

someString="this is some text. this is some more test. this is even more text."
tokens=nltk.word_tokenize(someString)
tokens=[token.lower() for token in tokens if len()>1]

bigram=ngrams(tokens,2)
aCounter=collections.Counter(bigram)

如果我:

print(aCounter)

然后它将按排序顺序输出二元组。

for element in aCounter:
     print(element)

将打印元素,但不带计数,也不按计数顺序。我想做一个 for 循环,在其中打印出文本中 X 个最常见的双字母组。

我实际上是在尝试同时学习 Python 和 nltk,所以这可能就是我在这里苦苦挣扎的原因(我认为这是一件微不足道的事情)。

您可能正在寻找已经存在的东西,即计数器上的 most_common 方法。来自文档:

Return a list of the n most common elements and their counts from the most common to the least. If n is omitted or None, most_common() returns all elements in the counter. Elements with equal counts are ordered arbitrarily:

您可以调用它并提供一个值 n 以获得 n 最常见的值计数对。例如:

from collections import Counter

# initialize with silly value.
c = Counter('aabbbccccdddeeeeefffffffghhhhiiiiiii')

# Print 4 most common values and their respective count.
for val, count in c.most_common(4):
    print("Value {0} -> Count {1}".format(val, count))

打印出:

Value f -> Count 7
Value i -> Count 7
Value e -> Count 5
Value h -> Count 4