在具有困难匹配条件的两个数组之间查找匹配项

Find matching items between two arrays with difficult match condition

我们有两个列表,每个事件列表都有一个 id、一个 start_time 和一个 start_time_ragestart_time_rangestart_time 附近设置容差以查找未命中事件。

objective是过滤current_matches,只包含之前匹配的那些。如果 id 匹配且 start_time 在范围内,则项目“出现”在列表中。

为了实现这个,我有这个循环,但是随着我们不断增加的数据量,它变得非常慢。我需要优化它:

current_matches.select! do |match_row|
  previous_matches_collection.any? do |previous_match|
    previous_match[:item_id] == match_row[:item_id] &&
      previous_match[:start_time_range].include?(match_row[:start_time].to_f)
  end
end

如果这只是我需要的 item_id 我可以这样做:

previous_ids = previous_matches_collection.collect{|i| i[:item_id] }
current_matches.select! do |match_row|
   previous_ids.include?(match_row[:item_id])
end

但我看不到在每个项目中匹配时间条件时使用该方法的方法。

数据方面,current_matches可以300,previous_matches_collection可以1k+。有没有一种方法可以在不迭代 300,000 种组合的情况下做到这一点?

编辑 - 示例数据:

previous_matches_collection = [
  { item_id: 1, start_time: 1597094395.1195982, start_time_range: (1597094393.6195982..1597094396.6195982) },
  { item_id: 1, start_time: 1597095083.116646, start_time_range: (1597095081.616646..1597095084.616646) },
  { item_id: 1, start_time: 1597095403.028223, start_time_range: (1597095401.528223..1597095404.528223) },
  { item_id: 2, start_time: 1597098035.056944, start_time_range: (1597098033.556944..1597098036.556944) },
  { item_id: 3, start_time: 1597096073.4109557, start_time_range: (1597096071.9109557..1597096074.9109557) },
  { item_id: 4, start_time: 1597094785.6987526, start_time_range: (1597094784.1987526..1597094787.1987526) },
  { item_id: 4, start_time: 1597098077.41271, start_time_range: (1597098075.91271..1597098078.91271) }
]


current_matches = [
  { item_id: 1, start_time: 1597094395.9195982 },
  { item_id: 1, start_time: 1597095085.116646, },
  { item_id: 1, start_time: 1597095404.228223, },
  { item_id: 2, start_time: 1597094395.1195982 },
  { item_id: 4, start_time: 1597094395.1195982 },
  { item_id: 6, start_time: 1597094395.1195982 },
  { item_id: 17, start_time: 1597094395.1195982 }
]

只需创建一个 Hash 将之前的匹配映射到它开始的时间戳。

然后对每个current_match做一个fetch获取时间戳如果存在,然后测试时间戳是否满足条件。

如果 previous_matches_collection1000 个东西而 current_matches300 那么这是 1300 哈希操作,每个都是 O(1) .这应该比您当前的解决方案更好。

h = previous_matches_collection.each_with_object({}) do |g,h|
  id = g[:item_id]
  h[id] = (h[id] || []) << g[:start_time_range]
end
  #=> {1=>[1597094393.6195982..1597094396.6195982,
  #        1597095081.616646..1597095084.616646,
  #        1597095401.528223..1597095404.528223],
  #    2=>[1597098033.556944..1597098036.556944],
  #    3=>[1597096071.9109557..1597096074.9109557],
  #    4=>[1597094784.1987526..1597094787.1987526,
  #        1597098075.91271..1597098078.91271]}   
current_matches.select do |g|
  id = g[:item_id]
  h.key?(id) && h[id].any? { |a| a.cover?(g[:start_time]) }
end
  #=> [{:item_id=>1, :start_time=>1597094395.919598},
  #    {:item_id=>1, :start_time=>1597095404.228223}] 

参见 Range#cover? and Enumerable#any?

如果第一个表达式 if h 没有键 id = g[:item_id]h[id] = (h[id] || []) 设置 h[id] #=> [](因为 (h[id] || []) => (nil || []) => [])之后 h[id] << g[:start_time_range] 被执行。也可以这样写

h = previous_matches_collection.
    each_with_object(Hash.new { |h,k| h[k] = [] }) do |g,h|
  h[g[:item_id]] << g[:start_time_range]
end

这使得对象 h 成为一个 initially-empty 散列,如果在 h 没有键时执行 h[k],则执行 h[k] = [] 的默认过程 k。参见Hash::new的第三种形式。

一个简单的优化是不使用 any? 来找到正确的 id。相反,使用正确的 id O(1).

进行查找哈希以获取所有 previous_matches_collection 元素

要进行的另一个优化是在 begin 元素上使用 cover? instead of include?. The difference being that cover? only compares the element with the begin and end of a range. While include? uses succ(成功,例如 1.succ #=> 2)来生成一个集合,通过该集合查找该元素。

("a".."z").include?("cc") #=> false
# is similar to:
# ["a", "b", "c", ..., "x", "y", "z"].include?("cc") #=> false

("a".."z").cover?("cc") #=> true
# is similar to:
# "a" <= "cc" && "cc <= "z" #=> true

上面的代码块演示了两者之间的区别。在您的场景中,您只想知道该值是否在范围内,因此 cover? 更适合并且是更快的选择。

start_time_ranges_by_item_id = previous_matches_collection
  .group_by { |match| match[:item_id] }
  .transform_values { |matches| matches.map { |match| match[:start_time_range] } }
start_time_ranges_by_item_id.default = []

现在有了 start_time_ranges_by_item_id 哈希构建,我们应该能够直接跳转到相关范围并从那里开始检查。

current_matches.select! do |match_row|
  item_id, start_time = match_row.values_at(:item_id, :start_time)
  start_time_ranges = start_time_ranges_by_item_id[item_id]
  start_time_ranges.any? { |range| range.cover?(start_time) }
end