访问包含 ngram 的计数器的元素

Question

我正在获取一个字符串，对其进行标记化，并想查看最常见的二元语法，这是我得到的：

import nltk
import collections
from nltk import ngrams

someString="this is some text. this is some more test. this is even more text."
tokens=nltk.word_tokenize(someString)
tokens=[token.lower() for token in tokens if len()>1]

bigram=ngrams(tokens,2)
aCounter=collections.Counter(bigram)

如果我：

print(aCounter)

然后它将按排序顺序输出二元组。

for element in aCounter:
     print(element)

将打印元素，但不带计数，也不按计数顺序。我想做一个 for 循环，在其中打印出文本中 X 个最常见的双字母组。

我实际上是在尝试同时学习 Python 和 nltk，所以这可能就是我在这里苦苦挣扎的原因（我认为这是一件微不足道的事情）。

Answer 1

您可能正在寻找已经存在的东西，即计数器上的 most_common 方法。来自文档：

Return a list of the n most common elements and their counts from the most common to the least. If n is omitted or None, most_common() returns all elements in the counter. Elements with equal counts are ordered arbitrarily:

您可以调用它并提供一个值 n 以获得 n 最常见的值计数对。例如：

from collections import Counter

# initialize with silly value.
c = Counter('aabbbccccdddeeeeefffffffghhhhiiiiiii')

# Print 4 most common values and their respective count.
for val, count in c.most_common(4):
    print("Value {0} -> Count {1}".format(val, count))

打印出：

Value f -> Count 7
Value i -> Count 7
Value e -> Count 5
Value h -> Count 4

访问包含 ngram 的计数器的元素

accessing elements of a counter containing ngrams

python

python-3.x

nltk

n-gram