在 Redshift 上创建 Apriori 项集

Question

我的目标是使用 Apriori 算法从在 AWS Redshift 上创建的购买 table 中找出有趣的见解。购买 table 如下所示 table。

-------------
ID | product
1    A
1    B
1    C
2    A
2    C

我能够计算产品的频率并过滤掉那些频率较低的观察结果。但是，我很难在 AWS Redshift 环境中创建项目集的规则。这就是我想要得到的：

------------------
itemset | count(*)
A,B       1
A,C       2
B,C       1

购买中有 1000 多种产品 table 所以我想学习如何编写有效且高效的查询来解决这个问题。谢谢。

Answer 1

使用自连接：

select t1.product, t2.product, count(*)
from t t1 join
     t t2
     on t1.id = t2.id and t1.product < t2.product
group by t1.product, t2.product;

这会将项目集分为两列。您也可以将它们连接在一起：

select t1.product || ',' || t2.product, count(*)
from t t1 join
     t t2
     on t1.id = t2.id and t1.product < t2.product
group by t1.product, t2.product
order by t1.product || ',' || t2.product;

Here 是 SQL Fiddle 说明代码有效。

在 Redshift 上创建 Apriori 项集

Create Apriori itemset on Redshift

sql

apriori

amazon-redshift