使用 Julia 生成 ngram
Generate ngrams with Julia
要在 Julia 中生成单词二元组,我可以简单地压缩原始列表和删除第一个元素的列表,例如:
julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
"the"
"lazy"
"fox"
"jumps"
"over"
"the"
"brown"
"dog"
julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
要生成三元组,我可以使用相同的 collect(zip(...))
习语来获得:
julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox")
("lazy","fox","jumps")
("fox","jumps","over")
("jumps","over","the")
("over","the","brown")
("the","brown","dog")
但是我必须在第三个列表中手动添加才能通过,有没有一种惯用的方法可以让我执行任何顺序的 n-gram ?
例如我想避免这样做来提取 5 克:
julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
这里有一个干净的 one-liner n-grams 任何长度。
ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))
它使用生成器推导式迭代元素数量 k
到 drop
。然后,使用 splat (...
) 运算符,它将 Drop
解包为 zip
,最后 collect
将 Zip
解包为 Array
.
julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
如您所见,这与您的解决方案非常相似 - 只是添加了一个简单的理解来迭代 drop
的元素数量,因此长度可以是动态的。
另一种方法是使用 Iterators.jl
的 partition()
:
ngram(s,n) = collect(partition(s, n, 1))
通过稍微改变输出并使用 SubArray
s 而不是 Tuple
s,几乎没有丢失,但可以避免分配和内存复制。如果基础单词列表是静态的,这没问题而且速度更快(在我的基准测试中也是如此)。代码:
ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]
和输出:
julia> ngram(s,5)
SubString{String}["the","lazy","fox","jumps","over"]
SubString{String}["lazy","fox","jumps","over","the"]
SubString{String}["fox","jumps","over","the","brown"]
SubString{String}["jumps","over","the","brown","dog"]
julia> ngram(s,5)[1][3]
"fox"
对于较大的单词列表,内存需求也大大减少。
另请注意,使用 生成器 可以更快地处理 ngrams one-by-one 并且占用更少的内存,并且可能足以满足所需的处理代码(计算某些内容或传递某些内容)哈希)。例如,使用不带 collect
的@Gnimuc 解决方案,即仅 partition(s, n, 1)
。
要在 Julia 中生成单词二元组,我可以简单地压缩原始列表和删除第一个元素的列表,例如:
julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
"the"
"lazy"
"fox"
"jumps"
"over"
"the"
"brown"
"dog"
julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
要生成三元组,我可以使用相同的 collect(zip(...))
习语来获得:
julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox")
("lazy","fox","jumps")
("fox","jumps","over")
("jumps","over","the")
("over","the","brown")
("the","brown","dog")
但是我必须在第三个列表中手动添加才能通过,有没有一种惯用的方法可以让我执行任何顺序的 n-gram ?
例如我想避免这样做来提取 5 克:
julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
这里有一个干净的 one-liner n-grams 任何长度。
ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))
它使用生成器推导式迭代元素数量 k
到 drop
。然后,使用 splat (...
) 运算符,它将 Drop
解包为 zip
,最后 collect
将 Zip
解包为 Array
.
julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
如您所见,这与您的解决方案非常相似 - 只是添加了一个简单的理解来迭代 drop
的元素数量,因此长度可以是动态的。
另一种方法是使用 Iterators.jl
的 partition()
:
ngram(s,n) = collect(partition(s, n, 1))
通过稍微改变输出并使用 SubArray
s 而不是 Tuple
s,几乎没有丢失,但可以避免分配和内存复制。如果基础单词列表是静态的,这没问题而且速度更快(在我的基准测试中也是如此)。代码:
ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]
和输出:
julia> ngram(s,5)
SubString{String}["the","lazy","fox","jumps","over"]
SubString{String}["lazy","fox","jumps","over","the"]
SubString{String}["fox","jumps","over","the","brown"]
SubString{String}["jumps","over","the","brown","dog"]
julia> ngram(s,5)[1][3]
"fox"
对于较大的单词列表,内存需求也大大减少。
另请注意,使用 生成器 可以更快地处理 ngrams one-by-one 并且占用更少的内存,并且可能足以满足所需的处理代码(计算某些内容或传递某些内容)哈希)。例如,使用不带 collect
的@Gnimuc 解决方案,即仅 partition(s, n, 1)
。