使用 Orange 提取关联规则时出现问题?

Problems while extractiong association rules with Orange?

我有一个维度为 (878049, 6) 的数据集。

看起来像这样:

我想提取 link 类别列与其他列的关联规则。因此,从文档中我尝试了以下 Orange-Associate:

In:

import Orange
data = Orange.data.Table("data.csv")

In:

data.domain.attributes

Out:

   (DiscreteVariable('Category', values=['ARSON', 'ASSAULT', 'BAD CHECKS', 'BRIBERY', 'BURGLARY', ...]),
 DiscreteVariable('Descript', values=['ABANDONMENT OF CHILD', 'ABORTION', 'ACCESS CARD INFORMATION, PUBLICATION OF', 'ACCESS CARD INFORMATION, THEFT OF', 'ACCIDENTAL BURNS', ...]),
 DiscreteVariable('DayOfWeek', values=['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', ...]),
 DiscreteVariable('PdDistrict', values=['BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION', 'NORTHERN', ...]),
 DiscreteVariable('Resolution', values=['ARREST, BOOKED', 'ARREST, CITED', 'CLEARED-CONTACT JUVENILE FOR MORE INFO', 'COMPLAINANT REFUSES TO PROSECUTE', 'DISTRICT ATTORNEY REFUSES TO PROSECUTE', ...]))

In:

from orangecontrib.associate.fpgrowth import *  

X, mapping = OneHot.encode(data, include_class=True)

X

Out:
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ..., 
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]], dtype=bool)

In:

 sorted(mapping.items())

Out:

[(0, (0, 0)),
 (1, (0, 1)),
 (2, (0, 2)),
 (3, (0, 3)),
 (4, (0, 4)),
 (5, (0, 5)),
 (6, (0, 6)),
 (7, (0, 7)),
....
 (950, (4, 15)),
 (951, (4, 16))]

然后:

In:

itemsets = dict(frequent_itemsets(X, .4))

len(itemsets)

Out:

1 

In:

 class_items = {item

                for item, var, _ in OneHot.decode(mapping, data, mapping)

                if var is data.domain.class_var}
In:
sorted(class_items)

Out:

[]

我认为问题在于我没有正确生成 Orange table。因此,我应该如何加载带有橙色的数据集以生成关联规则?。

更新

@K3---rnc 回答我试过这个:

itemsets = dict(frequent_itemsets(X, .1))

print (len(itemsets))

print( itemsets)

for itemset, _support in itemsets:

    print(' '.join('{}={}'.format(var.name, val)

                   for _, var, val in OneHot.decode(itemset, data, mapping)))

18
{frozenset({935}): 206403, frozenset({20}): 92304, frozenset({928}): 119908, frozenset({924}): 129211, frozenset({946}): 526790, frozenset({921}): 116707, frozenset({946, 932}): 93924, frozenset({919}): 121584, frozenset({932}): 157182, frozenset({21}): 126182, frozenset({922}): 125038, frozenset({16}): 174900, frozenset({929}): 105296, frozenset({918}): 133734, frozenset({16, 946}): 156586, frozenset({925}): 89431, frozenset({923}): 124965, frozenset({920}): 126810}

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-83-83a24c082126> in <module>()
      2 print (len(itemsets))
      3 print( itemsets)
----> 4 for itemset, _support in itemsets:
      5     print(' '.join('{}={}'.format(var.name, val)
      6                    for _, var, val in OneHot.decode(itemset, data, mapping)))

ValueError: not enough values to unpack (expected 2, got 1)

但是,我仍然遇到同样的问题...我无法提取关联规则。

您试图在数据域中没有任何 class 变量的情况下引入 class 化规则。如果你打印 data.domain,你会看到你只有常规属性和 metas。

[Category, DayOfWeek, PdDistrict, Resolution] {Descript, Address}

要解决此问题,您需要将其中一个属性设置为 class 变量。

new_domain = Orange.data.Domain(list(data.domain.attributes[1:]), 
             data.domain.attributes[0], 
             metas=data.domain.metas)

这会将 'Category' 属性设置为 class 变量。当然你可以通过上面的例子设置你自己的class变量。如果你现在打印 new_domain,你应该看到这样的东西:

[DayOfWeek, PdDistrict, Resolution | Category] {Descript, Address}

您可以查看找到的项目集包含的内容:

# Minimum 20% support. Decrease for more results
itemsets = dict(frequent_itemsets(X, .2))

for itemset, _support in itemsets.items():
    print(' '.join('{}={}'.format(var.name, val)
                   for _, var, val in OneHot.decode(itemset, data, mapping)))

将打印:

Category=ASSAULT DayOfWeek=Friday ...

或任何支持度为 40% 的项目集。