如何计算 python 中的 skipgrams?
How to compute skipgrams in python?
A k skipgram 是一个 ngram,它是所有 ngram 和每个 (k-i)skipgram 的超集,直到 (k-i)==0(其中包括 0 个跳过克)。那么如何在 python 中有效地计算这些 skipgrams 呢?
以下是我试过的代码,但没有按预期运行:
<pre>
input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
def find_skipgrams(input_list, N,K):
bigram_list = []
nlist=[]
K=1
for k in range(K+1):
for i in range(len(input_list)-1):
if i+k+1<len(input_list):
nlist=[]
for j in range(N+1):
if i+k+j+1<len(input_list):
nlist.append(input_list[i+k+j+1])
bigram_list.append(nlist)
return bigram_list
</pre>
上面的代码没有正确呈现,但是 print find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1)
给出了以下输出
[['this', 'happened', 'more'], ['happened', 'more', 'or'], ['more',
'or', 'less'], ['or', 'less'], ['less'], ['happened', 'more', 'or'],
['more', 'or', 'less'], ['or', 'less'], ['less'], ['less']]
这里列出的代码也没有给出正确的输出:
https://github.com/heaven00/skipgram/blob/master/skipgram.py
print skipgram_ndarray("What is your name") 给出:
['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name']
名字是一元组!
如何使用别人的实现 https://github.com/heaven00/skipgram/blob/master/skipgram.py ,其中 k = skip_size
和 n=ngram_order
:
def skipgram_ndarray(sent, k=1, n=2):
"""
This is not exactly a vectorized version, because we are still
using a for loop
"""
tokens = sent.split()
if len(tokens) < k + 2:
raise Exception("REQ: length of sentence > skip + 2")
matrix = np.zeros((len(tokens), k + 2), dtype=object)
matrix[:, 0] = tokens
matrix[:, 1] = tokens[1:] + ['']
result = []
for skip in range(1, k + 1):
matrix[:, skip + 1] = tokens[skip + 1:] + [''] * (skip + 1)
for index in range(1, k + 2):
temp = matrix[:, 0] + ',' + matrix[:, index]
map(result.append, temp.tolist())
limit = (((k + 1) * (k + 2)) / 6) * ((3 * n) - (2 * k) - 6)
return result[:limit]
def skipgram_list(sent, k=1, n=2):
"""
Form skipgram features using list comprehensions
"""
tokens = sent.split()
tokens_n = ['''tokens[index + j + {0}]'''.format(index)
for index in range(n - 1)]
x = '(tokens[index], ' + ', '.join(tokens_n) + ')'
query_part1 = 'result = [' + x + ' for index in range(len(tokens))'
query_part2 = ' for j in range(1, k+2) if index + j + n < len(tokens)]'
exec(query_part1 + query_part2)
return result
来自 OP 链接的 paper,以下字符串:
Insurgents killed in ongoing fighting
产量:
2-skip-bi-grams = {insurgents killed, insurgents in, insurgents
ongoing, killed in, killed ongoing, killed fighting, in ongoing, in
fighting, ongoing fighting}
2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing,
insurgents killed fighting, insurgents in ongoing, insurgents in
fighting, insurgents ongoing fighting, killed in ongoing, killed in
fighting, killed ongoing fighting, in ongoing fighting}.
对 NLTK 的 ngrams
代码稍作修改 (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):
from itertools import chain, combinations
import copy
from nltk.util import ngrams
def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
if pad_left:
sequence = chain((pad_symbol,) * (n-1), sequence)
if pad_right:
sequence = chain(sequence, (pad_symbol,) * (n-1))
return sequence
def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):
sequence_length = len(sequence)
sequence = iter(sequence)
sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)
if sequence_length + pad_left + pad_right < k:
raise Exception("The length of sentence + padding(s) < skip")
if n < k:
raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")
history = []
nk = n+k
# Return point for recursion.
if nk < 1:
return
# If n+k longer than sequence, reduce k by 1 and recur
elif nk > sequence_length:
for ng in skipgrams(list(sequence), n, k-1):
yield ng
while nk > 1: # Collects the first instance of n+k length history
history.append(next(sequence))
nk -= 1
# Iterative drop first item in history and picks up the next
# while yielding skipgrams for each iteration.
for item in sequence:
history.append(item)
current_token = history.pop(0)
# Iterates through the rest of the history and
# pick out all combinations the n-1grams
for idx in list(combinations(range(len(history)), n-1)):
ng = [current_token]
for _id in idx:
ng.append(history[_id])
yield tuple(ng)
# Recursively yield the skigrams for the rest of seqeunce where
# len(sequence) < n+k
for ng in list(skipgrams(history, n, k-1)):
yield ng
让我们做一些doctest来匹配论文中的例子:
>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
但请注意,如果 n+k > len(sequence)
,它将产生与 skipgrams(sequence, n, k-1)
相同的效果(这不是错误,它是故障安全功能),例如
>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))
>>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))
>>> four_skip_fourgrams = list(skipgrams(text, n=4, k=4))
>>> four_skip_fivegrams = list(skipgrams(text, n=5, k=4))
>>>
>>> print len(three_skip_trigrams), three_skip_trigrams
10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
>>> print len(three_skip_fourgrams), three_skip_fourgrams
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fourgrams), four_skip_fourgrams
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fivegrams), four_skip_fivegrams
1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]
这允许 n == k
但不允许 n > k
,如以下行所示:
if n < k:
raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")
为了便于理解,让我们尝试理解 "mystical" 行:
for idx in list(combinations(range(len(history)), n-1)):
pass # Do something
给定一个独特项目的列表,组合产生这个:
>>> from itertools import combinations
>>> x = [0,1,2,3,4,5]
>>> list(combinations(x,2))
[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]
并且由于令牌列表的索引始终是唯一的,例如
>>> sent = ['this', 'is', 'a', 'foo', 'bar']
>>> current_token = sent.pop(0) # i.e. 'this'
>>> range(len(sent))
[0,1,2,3]
可以计算范围的可能 combinations (without replacement):
>>> n = 3
>>> list(combinations(range(len(sent)), n-1))
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
如果我们将索引映射回标记列表:
>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]
然后我们与 current_token
连接,我们得到当前标记和上下文的 skipgrams+skip window:
>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
[('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]
所以在那之后我们继续下一个单词。
已编辑
最新的 NLTK 版本 3.2.5 实现了 skipgrams
。
这是来自 NLTK 存储库的 @jnothman 的更清晰的实现:https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538
def skipgrams(sequence, n, k, **kwargs):
"""
Returns all possible skipgrams generated from a sequence of items, as an iterator.
Skipgrams are ngrams that allows tokens to be skipped.
Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
:param sequence: the source data to be converted into trigrams
:type sequence: sequence or iter
:param n: the degree of the ngrams
:type n: int
:param k: the skip distance
:type k: int
:rtype: iter(tuple)
"""
# Pads the sequence as desired by **kwargs.
if 'pad_left' in kwargs or 'pad_right' in kwargs:
sequence = pad_sequence(sequence, n, **kwargs)
# Note when iterating through the ngrams, the pad_right here is not
# the **kwargs padding, it's for the algorithm to detect the SENTINEL
# object on the right pad to stop inner loop.
SENTINEL = object()
for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
head = ngram[:1]
tail = ngram[1:]
for skip_tail in combinations(tail, n - 1):
if skip_tail[-1] is SENTINEL:
continue
yield head + skip_tail
[输出]:
>>> from nltk.util import skipgrams
>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
尽管这会完全脱离您的代码并将其推迟到外部库;您可以使用 Colibri Core (https://proycon.github.io/colibri-core) 进行 skipgram 提取。它是专门为从大文本语料库中高效提取 n-gram 和 skipgram 而编写的库。代码库在 C++ 中(对于 speed/efficiency),但是 Python 绑定可用。
您正确地提到了效率,因为 skipgram 提取很快显示出指数级的复杂性,如果您像在 input_list
中那样只传递一个句子,这可能不是一个大问题,但如果您在大型语料库数据。为了缓解这种情况,您可以设置出现阈值等参数,或者要求 skipgram 的每次跳过至少可以填充 x 个不同的 n-gram。
import colibricore
#Prepare corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input, one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #corpus output
classfile = "somecorpus.colibri.cls" #class encoding output
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)
#Set options for skipgram extraction (mintokens is the occurrence threshold, maxlength maximum ngram/skipgram length)
colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True)
#Instantiate an empty pattern model
model = colibricore.UnindexedPatternModel()
#Train the model on the encoded corpus file (this does the skipgram extraction)
model.train(corpusfile, options)
#Load a decoder so we can view the output
decoder = colibricore.ClassDecoder(classfile)
#Output all skipgrams
for pattern in model:
if pattern.category() == colibricore.Category.SKIPGRAM:
print(pattern.tostring(decoder))
网站上有关于所有这些的更广泛的 Python 教程。
免责声明:我是 Colibri Core 的作者
有关完整信息,请参阅 this。
下面的示例已经在其中提到了它的用法并且非常有效!
>>>sent = "Insurgents killed in ongoing fighting".split()
>>>list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
A k skipgram 是一个 ngram,它是所有 ngram 和每个 (k-i)skipgram 的超集,直到 (k-i)==0(其中包括 0 个跳过克)。那么如何在 python 中有效地计算这些 skipgrams 呢?
以下是我试过的代码,但没有按预期运行:
<pre>
input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
def find_skipgrams(input_list, N,K):
bigram_list = []
nlist=[]
K=1
for k in range(K+1):
for i in range(len(input_list)-1):
if i+k+1<len(input_list):
nlist=[]
for j in range(N+1):
if i+k+j+1<len(input_list):
nlist.append(input_list[i+k+j+1])
bigram_list.append(nlist)
return bigram_list
</pre>
上面的代码没有正确呈现,但是 print find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1)
给出了以下输出
[['this', 'happened', 'more'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['happened', 'more', 'or'], ['more', 'or', 'less'], ['or', 'less'], ['less'], ['less']]
这里列出的代码也没有给出正确的输出: https://github.com/heaven00/skipgram/blob/master/skipgram.py
print skipgram_ndarray("What is your name") 给出: ['What,is', 'is,your', 'your,name', 'name,', 'What,your', 'is,name']
名字是一元组!
如何使用别人的实现 https://github.com/heaven00/skipgram/blob/master/skipgram.py ,其中 k = skip_size
和 n=ngram_order
:
def skipgram_ndarray(sent, k=1, n=2):
"""
This is not exactly a vectorized version, because we are still
using a for loop
"""
tokens = sent.split()
if len(tokens) < k + 2:
raise Exception("REQ: length of sentence > skip + 2")
matrix = np.zeros((len(tokens), k + 2), dtype=object)
matrix[:, 0] = tokens
matrix[:, 1] = tokens[1:] + ['']
result = []
for skip in range(1, k + 1):
matrix[:, skip + 1] = tokens[skip + 1:] + [''] * (skip + 1)
for index in range(1, k + 2):
temp = matrix[:, 0] + ',' + matrix[:, index]
map(result.append, temp.tolist())
limit = (((k + 1) * (k + 2)) / 6) * ((3 * n) - (2 * k) - 6)
return result[:limit]
def skipgram_list(sent, k=1, n=2):
"""
Form skipgram features using list comprehensions
"""
tokens = sent.split()
tokens_n = ['''tokens[index + j + {0}]'''.format(index)
for index in range(n - 1)]
x = '(tokens[index], ' + ', '.join(tokens_n) + ')'
query_part1 = 'result = [' + x + ' for index in range(len(tokens))'
query_part2 = ' for j in range(1, k+2) if index + j + n < len(tokens)]'
exec(query_part1 + query_part2)
return result
来自 OP 链接的 paper,以下字符串:
Insurgents killed in ongoing fighting
产量:
2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}
2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.
对 NLTK 的 ngrams
代码稍作修改 (https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):
from itertools import chain, combinations
import copy
from nltk.util import ngrams
def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
if pad_left:
sequence = chain((pad_symbol,) * (n-1), sequence)
if pad_right:
sequence = chain(sequence, (pad_symbol,) * (n-1))
return sequence
def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):
sequence_length = len(sequence)
sequence = iter(sequence)
sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)
if sequence_length + pad_left + pad_right < k:
raise Exception("The length of sentence + padding(s) < skip")
if n < k:
raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")
history = []
nk = n+k
# Return point for recursion.
if nk < 1:
return
# If n+k longer than sequence, reduce k by 1 and recur
elif nk > sequence_length:
for ng in skipgrams(list(sequence), n, k-1):
yield ng
while nk > 1: # Collects the first instance of n+k length history
history.append(next(sequence))
nk -= 1
# Iterative drop first item in history and picks up the next
# while yielding skipgrams for each iteration.
for item in sequence:
history.append(item)
current_token = history.pop(0)
# Iterates through the rest of the history and
# pick out all combinations the n-1grams
for idx in list(combinations(range(len(history)), n-1)):
ng = [current_token]
for _id in idx:
ng.append(history[_id])
yield tuple(ng)
# Recursively yield the skigrams for the rest of seqeunce where
# len(sequence) < n+k
for ng in list(skipgrams(history, n, k-1)):
yield ng
让我们做一些doctest来匹配论文中的例子:
>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
但请注意,如果 n+k > len(sequence)
,它将产生与 skipgrams(sequence, n, k-1)
相同的效果(这不是错误,它是故障安全功能),例如
>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))
>>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))
>>> four_skip_fourgrams = list(skipgrams(text, n=4, k=4))
>>> four_skip_fivegrams = list(skipgrams(text, n=5, k=4))
>>>
>>> print len(three_skip_trigrams), three_skip_trigrams
10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
>>> print len(three_skip_fourgrams), three_skip_fourgrams
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fourgrams), four_skip_fourgrams
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fivegrams), four_skip_fivegrams
1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]
这允许 n == k
但不允许 n > k
,如以下行所示:
if n < k:
raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")
为了便于理解,让我们尝试理解 "mystical" 行:
for idx in list(combinations(range(len(history)), n-1)):
pass # Do something
给定一个独特项目的列表,组合产生这个:
>>> from itertools import combinations
>>> x = [0,1,2,3,4,5]
>>> list(combinations(x,2))
[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]
并且由于令牌列表的索引始终是唯一的,例如
>>> sent = ['this', 'is', 'a', 'foo', 'bar']
>>> current_token = sent.pop(0) # i.e. 'this'
>>> range(len(sent))
[0,1,2,3]
可以计算范围的可能 combinations (without replacement):
>>> n = 3
>>> list(combinations(range(len(sent)), n-1))
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
如果我们将索引映射回标记列表:
>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]
然后我们与 current_token
连接,我们得到当前标记和上下文的 skipgrams+skip window:
>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
[('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]
所以在那之后我们继续下一个单词。
已编辑
最新的 NLTK 版本 3.2.5 实现了 skipgrams
。
这是来自 NLTK 存储库的 @jnothman 的更清晰的实现:https://github.com/nltk/nltk/blob/develop/nltk/util.py#L538
def skipgrams(sequence, n, k, **kwargs):
"""
Returns all possible skipgrams generated from a sequence of items, as an iterator.
Skipgrams are ngrams that allows tokens to be skipped.
Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
:param sequence: the source data to be converted into trigrams
:type sequence: sequence or iter
:param n: the degree of the ngrams
:type n: int
:param k: the skip distance
:type k: int
:rtype: iter(tuple)
"""
# Pads the sequence as desired by **kwargs.
if 'pad_left' in kwargs or 'pad_right' in kwargs:
sequence = pad_sequence(sequence, n, **kwargs)
# Note when iterating through the ngrams, the pad_right here is not
# the **kwargs padding, it's for the algorithm to detect the SENTINEL
# object on the right pad to stop inner loop.
SENTINEL = object()
for ngram in ngrams(sequence, n + k, pad_right=True, right_pad_symbol=SENTINEL):
head = ngram[:1]
tail = ngram[1:]
for skip_tail in combinations(tail, n - 1):
if skip_tail[-1] is SENTINEL:
continue
yield head + skip_tail
[输出]:
>>> from nltk.util import skipgrams
>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
尽管这会完全脱离您的代码并将其推迟到外部库;您可以使用 Colibri Core (https://proycon.github.io/colibri-core) 进行 skipgram 提取。它是专门为从大文本语料库中高效提取 n-gram 和 skipgram 而编写的库。代码库在 C++ 中(对于 speed/efficiency),但是 Python 绑定可用。
您正确地提到了效率,因为 skipgram 提取很快显示出指数级的复杂性,如果您像在 input_list
中那样只传递一个句子,这可能不是一个大问题,但如果您在大型语料库数据。为了缓解这种情况,您可以设置出现阈值等参数,或者要求 skipgram 的每次跳过至少可以填充 x 个不同的 n-gram。
import colibricore
#Prepare corpus data (will be encoded for efficiency)
corpusfile_plaintext = "somecorpus.txt" #input, one sentence per line
encoder = colibricore.ClassEncoder()
encoder.build(corpusfile_plaintext)
corpusfile = "somecorpus.colibri.dat" #corpus output
classfile = "somecorpus.colibri.cls" #class encoding output
encoder.encodefile(corpusfile_plaintext,corpusfile)
encoder.save(classfile)
#Set options for skipgram extraction (mintokens is the occurrence threshold, maxlength maximum ngram/skipgram length)
colibricore.PatternModelOptions(mintokens=2,maxlength=8,doskipgrams=True)
#Instantiate an empty pattern model
model = colibricore.UnindexedPatternModel()
#Train the model on the encoded corpus file (this does the skipgram extraction)
model.train(corpusfile, options)
#Load a decoder so we can view the output
decoder = colibricore.ClassDecoder(classfile)
#Output all skipgrams
for pattern in model:
if pattern.category() == colibricore.Category.SKIPGRAM:
print(pattern.tostring(decoder))
网站上有关于所有这些的更广泛的 Python 教程。
免责声明:我是 Colibri Core 的作者
有关完整信息,请参阅 this。
下面的示例已经在其中提到了它的用法并且非常有效!
>>>sent = "Insurgents killed in ongoing fighting".split()
>>>list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]