SQL 喜欢:如何计算 <item,user> 数据的交集和并集

SQL like: How to calculate intersection and union of <item,user> data

在 SQL 需要帮助:

我有一个包含以下列的数据:

每一行表示某项商品已被某用户购买。 示例:

ItemId UserId

   200    user1

   200    user3

   200    user4

   300    user5

   300    user3

对于每个我想计算以下输出 table:

输出示例(来自上例):

i_itemId  j_itemId  users(i)  users(j)  users(i,j)  users(i,~j)  users(~i, j)

200  200  3  3  3  0  0

200  300  3  2  1  2  1

300  300  2  2  2  0  0

300  200  3  2  1  1  2

注意

  1. 位于云端的数据 table 很大 (11 GB)。我有一个 SQL 的框架可以使用。所以我无法下载文件和 运行 python (例如) 所以解决方案必须以有效的方式
  2. 写成SQL
  3. 解决方案不必是一个 SQL 语句。
  4. 我正在寻找有效的解决方案
  5. 我们可以假设这是一个关键
  6. 如果有人对这里的问题header有更好的选择,我会很乐意更新它:)

我不确定是否有 "easy" 方法来完成此操作。一种方法相当蛮力:使用 cross join 生成所有行。然后对每个单独的计数使用子查询:

select i1.itemid, i2.itemid, i1.num as cnt1, i2.num as cnt2,
       (select count(*)
        from t u1 join
             t u2
             on u1.userid = u2.userid
        where u1.itemid = i1.itemid and u2.itemid = i2.itemid
       ) as cnt_1_2,
       (select count(*)
        from t u1 left join
             t u2
             on u1.userid = u2.userid and u2.itemid = i2.itemid
        where u1.itemid = i1.itemid and u2.itemid is null
       ) as cnt_1_not2,
       (select count(*)
        from t u1 left join
             t u2
             on u1.userid = u2.userid and u1.itemid = i1.itemid
        where u2.itemid = i2.itemid and u1.itemid is null
       ) as cnt_not1_2
from (select itemid, count(*) as num from t group by itemid) i1 cross join
     (select itemid, count(*) as num from t group by itemid) i2;

这是食谱

1) 创建临时 table 来收集 I 和 J 总数。

Disclaimer :
This example uses a MS SQL server datatype: INT.
So change it to the numeric type that your RDBMS supports.
Btw, in MS SQL Server, temporary tables start with #

create table TempTotals (iItemId int, jItemId int, TotalUsers int); 

2) 填写总数

delete from TempTotals;
insert into TempTotals (iItemId, jItemId, TotalUsers)
select 
    t1.ItemId as iItemId, 
    t2.ItemId as jItemId, 
    count(distinct t1.UserId) as TotalUsers
from YourTable t1
full join YourTable t2 on (t1.UserId = t2.UserId)
group by t1.ItemId, t2.ItemId;

3) 自加入临时 table 得到所有总数

select 
 ij.iItemId, 
 ij.jItemId,
 i.TotalUsers as Users_I,
 j.TotalUsers as Users_J,
 ij.TotalUsers as Users_I_and_J, 
 (i.TotalUsers - ij.TotalUsers) as Users_I_no_J,
 (j.TotalUsers - ij.TotalUsers) as Users_J_no_I
from TempTotals ij
left join TempTotals i on (i.iItemId = ij.iItemId and i.iItemId = i.jItemId)
left join TempTotals j on (j.jItemId = ij.jItemId and j.iItemId = j.jItemId)

如果您使用的是 Oracle 数据库,则可以将嵌套的 tables(集合)与多重集运算符进行比较。并获取具有基数的集合中元素的数量。

所以你可以做的是:

  • 按 itemid 分组,将所有用户收集到一个嵌套的 table
  • 将其输出与自身交叉连接
  • 使用多重集 intersect/except 运算符根据需要获取集合中的元素数量

看起来有点像:

create table t (
  ItemId int, UserId varchar2(10)
);
insert into t values (   200  ,  'user1');
insert into t values (   200  ,  'user3');
insert into t values (   200  ,  'user4');
insert into t values (   300  ,  'user5');
insert into t values (   300  ,  'user3');

commit;

create or replace type users_t as table of varchar2(10);
/

with grps as (
  select itemid, cast ( collect ( userid ) as users_t ) users
  from   t
  group  by itemid
)
  select g1.itemid i, g2.itemid j,
         cardinality ( g1.users ) num_i,
         cardinality ( g2.users ) num_j,
         cardinality ( g1.users multiset intersect g2.users ) i_and_j,
         cardinality ( g1.users multiset except g2.users ) i_not_j,
         cardinality ( g2.users multiset except g1.users ) j_not_i
  from   grps g1
  cross  join grps g2;

I     J     NUM_I   NUM_J   I_AND_J   I_NOT_J   J_NOT_I   
  200   200       3       3         3         0         0 
  200   300       3       2         1         2         1 
  300   200       2       3         1         1         2 
  300   300       2       2         2         0         0 

如有必要,您可以通过在 i = j 时跳过 except 运算符来获得更高的性能,例如:

case 
  when g1.itemid = g2.itemid then 0 
  else cardinality ( g1.users multiset intersect g2.users )
end