使用 Julia 生成 ngram

Question

要在 Julia 中生成单词二元组，我可以简单地压缩原始列表和删除第一个元素的列表，例如：

julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
 "the"  
 "lazy" 
 "fox"  
 "jumps"
 "over" 
 "the"  
 "brown"
 "dog"  

julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog")

要生成三元组，我可以使用相同的 collect(zip(...)) 习语来获得：

julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox")  
 ("lazy","fox","jumps")
 ("fox","jumps","over")
 ("jumps","over","the")
 ("over","the","brown")
 ("the","brown","dog")

但是我必须在第三个列表中手动添加才能通过，有没有一种惯用的方法可以让我执行任何顺序的 n-gram ?

例如我想避免这样做来提取 5 克：

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

Answer 1

这里有一个干净的 one-liner n-grams 任何长度。

ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))

它使用生成器推导式迭代元素数量 k 到 drop。然后，使用 splat (...) 运算符，它将 Drop 解包为 zip，最后 collect 将 Zip 解包为 Array.

julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

如您所见，这与您的解决方案非常相似 - 只是添加了一个简单的理解来迭代 drop 的元素数量，因此长度可以是动态的。

Answer 2

另一种方法是使用 Iterators.jl 的 partition():

ngram(s,n) = collect(partition(s, n, 1))

Answer 3

通过稍微改变输出并使用 SubArrays 而不是 Tuples，几乎没有丢失，但可以避免分配和内存复制。如果基础单词列表是静态的，这没问题而且速度更快（在我的基准测试中也是如此）。代码：

ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]

和输出：

julia> ngram(s,5)
 SubString{String}["the","lazy","fox","jumps","over"] 
 SubString{String}["lazy","fox","jumps","over","the"] 
 SubString{String}["fox","jumps","over","the","brown"]
 SubString{String}["jumps","over","the","brown","dog"]

julia> ngram(s,5)[1][3]
"fox"

对于较大的单词列表，内存需求也大大减少。

另请注意，使用 生成器 可以更快地处理 ngrams one-by-one 并且占用更少的内存，并且可能足以满足所需的处理代码（计算某些内容或传递某些内容）哈希）。例如，使用不带 collect 的@Gnimuc 解决方案，即仅 partition(s, n, 1)。

使用 Julia 生成 ngram

Generate ngrams with Julia

nlp

zip

julia

n-gram