如何在 Ruby 中的两个字符串中找到相同子序列的索引?
How to find indices of identical sub-sequences in two strings in Ruby?
此处 class DNA
的每个实例都对应一个字符串,例如 'GCCCAC'
。可以从这些字符串构造包含 k-mers 的子字符串数组。对于这个字符串,有 1 聚体、2 聚体、3 聚体、4 聚体、5 聚体和一个 6 聚体:
- 6 个 1 聚体:
["G", "C", "C", "C", "A", "C"]
- 5 个 2 聚体:
["GC", "CC", "CC", "CA", "AC"]
- 4 个 3 聚体:
["GCC", "CCC", "CCA", "CAC"]
- 3 个 4 聚体:
["GCCC", "CCCA", "CCAC"]
- 2 个 5 聚体:
["GCCCA", "CCCAC"]
- 1 个 6 聚体:
["GCCCAC"]
模式应该很明显。有关详细信息,请参阅 Wiki。
问题是编写 DNA class 的方法 shared_kmers(k, dna2) 其中 returns 所有对 [i, j] 的数组,其中此 DNA 对象(接收消息的)与 dna2 共享一个共同的 k-mer,位于该 dna 的位置 i 和 dna2 的位置 j。
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]
dna1.shared_kmers(4, dna2)
#=> [[2, 0]]
dna1.shared_kmers(5, dna2)
#=> []
class DNA
attr_accessor :sequencing
def initialize(sequencing)
@sequencing = sequencing
end
def kmers(k)
@sequencing.each_char.each_cons(k).map(&:join)
end
def shared_kmers(k, dna)
kmers(k).each_with_object([]).with_index do |(kmer, result), index|
dna.kmers(k).each_with_index do |other_kmer, other_kmer_index|
result << [index, other_kmer_index] if kmer.eql?(other_kmer)
end
end
end
end
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.kmers(2)
#=> ["GC", "CC", "CC", "CA", "AC"]
dna2.kmers(2)
#=> ["CC", "CA", "AC", "CG", "GC"]
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]
dna1.shared_kmers(4, dna2)
#=> [[2, 0]]
dna1.shared_kmers(5, dna2)
#=> []
我只会解决你问题的症结所在,不涉及classDNA
。重组后面的内容应该很容易。
代码
def match_kmers(s1, s2, k)
h1 = dna_to_index(s1, k)
h2 = dna_to_index(s2, k)
h1.flat_map { |k,_| h1[k].product(h2[k] || []) }
end
def dna_to_index(dna, k)
dna.each_char.
with_index.
each_cons(k).
with_object({}) {|arr,h| (h[arr.map(&:first).join] ||= []) << arr.first.last}
end
例子
dna1 = 'GCCCAC'
dna2 = 'CCACGC'
match_kmers(dna1, dna2, 2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
match_kmers(dna2, dna1, 2)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
match_kmers(dna1, dna2, 3)
#=> [[2, 0], [3, 1]]
match_kmers(dna2, dna1, 3)
#=> [[0, 2], [1, 3]]
match_kmers(dna1, dna2, 4)
#=> [[2, 0]]
match_kmers(dna2, dna1, 4)
#=> [[0, 2]]
match_kmers(dna1, dna2, 5)
#=> []
match_kmers(dna2, dna1, 5)
#=> []
match_kmers(dna1, dna2, 6)
#=> []
match_kmers(dna2, dna1, 6)
#=> []
说明
考虑 dna1 = 'GCCCAC'
。这包含 5 个 2-mers (k = 2
):
dna1.each_char.each_cons(2).to_a.map(&:join)
#=> ["GC", "CC", "CC", "CA", "AC"]
同样,对于dna2 = 'CCACGC'
:
dna2.each_char.each_cons(2).to_a.map(&:join)
#=> ["CC", "CA", "AC", "CG", "GC"]
这些是由 dna_to_index
分别为 dna1
和 dna2
生成的哈希键。哈希值是 DNA 字符串中相应键开始位置的索引数组。让我们计算 k = 2
:
的哈希值
h1 = dna_to_index(dna1, 2)
#=> {"GC"=>[0], "CC"=>[1, 2], "CA"=>[3], "AC"=>[4]}
h2 = dna_to_index(dna2, 2)
#=> {"CC"=>[0], "CA"=>[1], "AC"=>[2], "CG"=>[3], "GC"=>[4]}
h1
表明:
"GC"
从 dna1
的索引 0 开始
"CC"
从 dna1
的索引 1 和 2 开始
"CA"
从 dna1
的索引 3 开始
"CC"
从 dna1
的索引 4 开始
h2
也有类似的解释。参见 Enumerable#flat_map and Array#product。
然后使用方法 match_kmers
构造所需的索引对数组 [i, j]
使得 h1[i] = h2[j]
.
现在让我们看看为 3-mers (k = 3
) 生成的哈希:
h1 = dna_to_index(dna1, 3)
#=> {"GCC"=>[0], "CCC"=>[1], "CCA"=>[2], "CAC"=>[3]}
h2 = dna_to_index(dna2, 3)
#=> {"CCA"=>[0], "CAC"=>[1], "ACG"=>[2], "CGC"=>[3]}
我们看到 dna1
中的第一个 3-mer 是 "GCC"
,从索引 0 开始。但是这个 3-mer 没有出现在 dna2
中,所以有返回的数组中没有元素 [0, X]
(X
只是一个占位符)。 "CCC"
也不是第二个哈希中的键。然而,"CCA"
和 "CAC"
出现在第二个散列中,因此返回的数组是:
h1["CCA"].product(h2["CCA"]) + h1["CAC"].product(h2["CAC"])
#=> [[2, 0]] + [[3, 1]]
#=> [[2, 0], [3, 1]]
我将首先编写一个方法来枚举给定长度的子序列(即 k-mers):
class DNA
def initialize(sequence)
@sequence = sequence
end
def each_kmer(length)
return enum_for(:each_kmer, length) unless block_given?
0.upto(@sequence.length - length) { |i| yield @sequence[i, length] }
end
end
DNA.new('GCCCAC').each_kmer(2).to_a
#=> ["GC", "CC", "CC", "CA", "AC"]
在此之上,您可以使用嵌套循环轻松收集相同 k-mers 的索引:
class DNA
# ...
def shared_kmers(length, other)
indices = []
each_kmer(length).with_index do |k, i|
other.each_kmer(length).with_index do |l, j|
indices << [i, j] if k == l
end
end
indices
end
end
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
不幸的是,上面的代码为接收器中的每个k-mer遍历了other.each_kmer
。我们可以通过预先构建一个包含 other
中每个 k-mer 的所有索引的哈希来优化它:
class DNA
# ...
def shared_kmers(length, other)
hash = Hash.new { |h, k| h[k] = [] }
other.each_kmer(length).with_index { |k, i| hash[k] << i }
indices = []
each_kmer(length).with_index do |k, i|
hash[k].each { |j| indices << [i, j] }
end
indices
end
end
此处 class DNA
的每个实例都对应一个字符串,例如 'GCCCAC'
。可以从这些字符串构造包含 k-mers 的子字符串数组。对于这个字符串,有 1 聚体、2 聚体、3 聚体、4 聚体、5 聚体和一个 6 聚体:
- 6 个 1 聚体:
["G", "C", "C", "C", "A", "C"]
- 5 个 2 聚体:
["GC", "CC", "CC", "CA", "AC"]
- 4 个 3 聚体:
["GCC", "CCC", "CCA", "CAC"]
- 3 个 4 聚体:
["GCCC", "CCCA", "CCAC"]
- 2 个 5 聚体:
["GCCCA", "CCCAC"]
- 1 个 6 聚体:
["GCCCAC"]
模式应该很明显。有关详细信息,请参阅 Wiki。
问题是编写 DNA class 的方法 shared_kmers(k, dna2) 其中 returns 所有对 [i, j] 的数组,其中此 DNA 对象(接收消息的)与 dna2 共享一个共同的 k-mer,位于该 dna 的位置 i 和 dna2 的位置 j。
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]
dna1.shared_kmers(4, dna2)
#=> [[2, 0]]
dna1.shared_kmers(5, dna2)
#=> []
class DNA
attr_accessor :sequencing
def initialize(sequencing)
@sequencing = sequencing
end
def kmers(k)
@sequencing.each_char.each_cons(k).map(&:join)
end
def shared_kmers(k, dna)
kmers(k).each_with_object([]).with_index do |(kmer, result), index|
dna.kmers(k).each_with_index do |other_kmer, other_kmer_index|
result << [index, other_kmer_index] if kmer.eql?(other_kmer)
end
end
end
end
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.kmers(2)
#=> ["GC", "CC", "CC", "CA", "AC"]
dna2.kmers(2)
#=> ["CC", "CA", "AC", "CG", "GC"]
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]
dna1.shared_kmers(4, dna2)
#=> [[2, 0]]
dna1.shared_kmers(5, dna2)
#=> []
我只会解决你问题的症结所在,不涉及classDNA
。重组后面的内容应该很容易。
代码
def match_kmers(s1, s2, k)
h1 = dna_to_index(s1, k)
h2 = dna_to_index(s2, k)
h1.flat_map { |k,_| h1[k].product(h2[k] || []) }
end
def dna_to_index(dna, k)
dna.each_char.
with_index.
each_cons(k).
with_object({}) {|arr,h| (h[arr.map(&:first).join] ||= []) << arr.first.last}
end
例子
dna1 = 'GCCCAC'
dna2 = 'CCACGC'
match_kmers(dna1, dna2, 2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
match_kmers(dna2, dna1, 2)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
match_kmers(dna1, dna2, 3)
#=> [[2, 0], [3, 1]]
match_kmers(dna2, dna1, 3)
#=> [[0, 2], [1, 3]]
match_kmers(dna1, dna2, 4)
#=> [[2, 0]]
match_kmers(dna2, dna1, 4)
#=> [[0, 2]]
match_kmers(dna1, dna2, 5)
#=> []
match_kmers(dna2, dna1, 5)
#=> []
match_kmers(dna1, dna2, 6)
#=> []
match_kmers(dna2, dna1, 6)
#=> []
说明
考虑 dna1 = 'GCCCAC'
。这包含 5 个 2-mers (k = 2
):
dna1.each_char.each_cons(2).to_a.map(&:join)
#=> ["GC", "CC", "CC", "CA", "AC"]
同样,对于dna2 = 'CCACGC'
:
dna2.each_char.each_cons(2).to_a.map(&:join)
#=> ["CC", "CA", "AC", "CG", "GC"]
这些是由 dna_to_index
分别为 dna1
和 dna2
生成的哈希键。哈希值是 DNA 字符串中相应键开始位置的索引数组。让我们计算 k = 2
:
h1 = dna_to_index(dna1, 2)
#=> {"GC"=>[0], "CC"=>[1, 2], "CA"=>[3], "AC"=>[4]}
h2 = dna_to_index(dna2, 2)
#=> {"CC"=>[0], "CA"=>[1], "AC"=>[2], "CG"=>[3], "GC"=>[4]}
h1
表明:
"GC"
从dna1
的索引 0 开始
"CC"
从dna1
的索引 1 和 2 开始
"CA"
从dna1
的索引 3 开始
"CC"
从dna1
的索引 4 开始
h2
也有类似的解释。参见 Enumerable#flat_map and Array#product。
然后使用方法 match_kmers
构造所需的索引对数组 [i, j]
使得 h1[i] = h2[j]
.
现在让我们看看为 3-mers (k = 3
) 生成的哈希:
h1 = dna_to_index(dna1, 3)
#=> {"GCC"=>[0], "CCC"=>[1], "CCA"=>[2], "CAC"=>[3]}
h2 = dna_to_index(dna2, 3)
#=> {"CCA"=>[0], "CAC"=>[1], "ACG"=>[2], "CGC"=>[3]}
我们看到 dna1
中的第一个 3-mer 是 "GCC"
,从索引 0 开始。但是这个 3-mer 没有出现在 dna2
中,所以有返回的数组中没有元素 [0, X]
(X
只是一个占位符)。 "CCC"
也不是第二个哈希中的键。然而,"CCA"
和 "CAC"
出现在第二个散列中,因此返回的数组是:
h1["CCA"].product(h2["CCA"]) + h1["CAC"].product(h2["CAC"])
#=> [[2, 0]] + [[3, 1]]
#=> [[2, 0], [3, 1]]
我将首先编写一个方法来枚举给定长度的子序列(即 k-mers):
class DNA
def initialize(sequence)
@sequence = sequence
end
def each_kmer(length)
return enum_for(:each_kmer, length) unless block_given?
0.upto(@sequence.length - length) { |i| yield @sequence[i, length] }
end
end
DNA.new('GCCCAC').each_kmer(2).to_a
#=> ["GC", "CC", "CC", "CA", "AC"]
在此之上,您可以使用嵌套循环轻松收集相同 k-mers 的索引:
class DNA
# ...
def shared_kmers(length, other)
indices = []
each_kmer(length).with_index do |k, i|
other.each_kmer(length).with_index do |l, j|
indices << [i, j] if k == l
end
end
indices
end
end
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
不幸的是,上面的代码为接收器中的每个k-mer遍历了other.each_kmer
。我们可以通过预先构建一个包含 other
中每个 k-mer 的所有索引的哈希来优化它:
class DNA
# ...
def shared_kmers(length, other)
hash = Hash.new { |h, k| h[k] = [] }
other.each_kmer(length).with_index { |k, i| hash[k] << i }
indices = []
each_kmer(length).with_index do |k, i|
hash[k].each { |j| indices << [i, j] }
end
indices
end
end