有没有办法使用 NTLK 获取多个 ngram 订单,而不是获取对生成器的迭代?

Is there a way to get multiple ngram orders using NTLK instead of obtaining a iterating over a generator?

我需要 ngram。我知道 nltk.utils.ngrams 可用于获取 ngrams,但实际上,ngrams 函数 returns 是一个生成器对象。我总是可以迭代它并将 ngram 存储在列表中。但是有没有另一种更直接的方法来获取列表中的这些 ngram 而无需遍历它们?

@georg 的评论非常准确。

In [12]: from nltk.util import ngrams

In [13]: g = ngrams([1,2,3,4,5], 3)

In [14]: list(g)
Out[14]: [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

In [15]: g = ngrams([1,2,3,4,5], 3)

In [16]: map(lambda x: x, g)
Out[16]: [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

或者没有 nltk:

from itertools import chain

def ngrams(L, n = 2):
    orders = [n] if type(n) is int else sorted(list(n))
    return list(chain(*[zip(*[L[i:] for i in range(n)]) for n in orders]))

>>> ngrams([1,2,3,4,5], n = 3)
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]
>>> ngrams([1,2,3,4,5], n = [2,3])
[(1, 2), (2, 3), (3, 4), (4, 5), (1, 2, 3), (2, 3, 4), (3, 4, 5)]

实际上有一个内置函数可以获取多个 ngrams 调用顺序 everygrams ,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/util.py#L504

>>> from nltk import everygrams
>>> sent = 'a b c'.split()
# By default, it will extract every possible order of ngrams.
>>> list(everygrams(sent))
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
# You can set a max order or ngrams.
>>> list(everygrams(sent, max_len=2))
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c')]
# Or specify a range.
>>> list(everygrams(sent, min_len=2, max_len=3))
[('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]