Rails 中用于数据分析的分组记录

Question

我有两个 table 与 habtm 关系相关联（通过 table）。

Table1
id  : integer
name: string

Table2
id  : integer
name: string

Table3
id       : integer
table1_id: integer
table2_id: integer

我需要根据表 2 中的相似记录对表 1 的记录进行分组。示例：

userx = Table1.create()
user1.table2_ids = 3, 14, 15
user2.table2_ids = 3, 14, 15, 16
user3.table2_ids = 3, 14, 16
user4.table2_ids = 2, 5, 7
user5.table2_ids = 3, 5

我想要的分组结果是这样的

=> [ [ [1,2], [3, 14, 15] ], [ [2,3], [3,14, 16] ], [ [ 1, 2, 3, 5], [3] ]  ]

第一个数组是用户 ID，第二个是 table2_ids。我有任何可能的 SQL 解决方案还是我需要创建某种算法？

更新：好的，我有一个代码可以像我说的那样工作。也许可以帮助我的人会发现理解我的想法很有用。

def self.compare
    hash = {}
    Table1.find_each do |table_record|
      Table1.find_each do |another_table_record|
        if table_record != another_table_record
          results = table_record.table2_ids & another_table_record.table2_ids
          hash["#{table_record.id}_#{another_table_record.id}"] = results if !results.empty?
        end
      end
    end
    #hash = hash.delete_if{|k,v| v.empty?}
    hash.sort_by{|k,v| v.count}.to_h
  end

但我敢打赌，您可以想象向我展示输出需要多长时间。对于我的 500 条 Table1 记录，它大约需要 1-2 分钟。如果我有更多，时间会逐渐增加，所以我需要一些优雅的解决方案或 SQL 查询。

Answer 1

Table1.find_each do |table_record|
  Table1.find_each do |another_table_record|
    ...

以上代码存在性能问题，您必须查询数据库 N*N 次，可以优化为一次查询。

# Query table3, constructing the data useful to us
# { table1_id: [table2_ids], ... }
records = Table3.all.group_by { |t| t.table1_id }.map { |t1_id, t3_records|
    [t1_id, t3_records.map(&:table2_id)]
  }.to_h

然后您可以对 records 执行完全相同的操作以获得最终结果哈希。

更新：

@AKovtunov 小姐你懂我的意思。我的代码是第一步。使用具有 {t1_id: t2_ids} 散列的 records，您可以这样做：

hash = {}
records.each do |t1_id, t2_ids|
  records.each do |tt1_id, tt2_ids|
    if t1_id != tt1_id
      inter = t2_ids & tt2_ids
      hash["#{t1_id}_#{tt1_id}"] = inter if !inter.empty?
    end
  end
end

Rails 中用于数据分析的分组记录

Group records for data analysis in Rails

ruby

database

ruby-on-rails

data-analysis