如何使先验算法更快?
How to make apriori algorithm faster?
TLDR 问题
这是一个很长的问题,所以这里是一个 TLDR 版本:
我正在实施先验算法,它很慢。缓慢的部分是当我试图生成 Lk
形式 Ck
并且它必须扫描整个数据集(或多或少)以找出候选人是否频繁。我怎样才能使这部分更快?有没有加速的数据结构
重复搜索数据集?
长问题
我有一个作业要用 python 编写先验算法。约束是不使用 pandas
并且不使用任何实现 aperiori 的模块,如 apyori
。所以生成C1
和L1
根本不是问题。从 Lk-1
生成 Ck
也可以。整个事情的瓶颈是从 Ck
生成 Lk
。本节将每个候选者与整个数据集进行比较,该数据集需要 ages for small minimum_supports.
我花了一些时间搜索 apriori 的改进版本,在我能理解的版本中,这个是最好的(IMO):https://arxiv.org/pdf/1403.3948.pdf
本文提供了为每个项目保留一个交易列表,指示在哪些交易中出现了 item/itemset(我们称之为 found_in
)。有了它,我们可以减少从 Ck
生成 Lk
时的搜索次数,因为我们只能扫描该列表中提到的元素 (found_in
).
我实现了它,它减少了 4 倍左右的时间,这太棒了。然而,它对于作业来说还不够快,因为我应该提取 40,000 个频繁模式。
所以我在想也许算法很好,但我使用的 python 的数据结构太慢了,无法跟上。所以这是我的问题:
- 我可以通过使用更好的数据结构使这个算法更快吗?也许在
ctype
中?
- 我的代码是否有任何问题导致它挂起?这个算法的结果对我来说看起来很合理(与
apyori
的输出相比)
- 任何改善它或容易出现某些情况的提示?
我知道这个问题需要很长时间才能正确调查,我没想到会这样。所以任何小提示都将不胜感激。
慢的代码部分:
def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
subset_time = 0
Lk = {}
for candidate, TIDs in Ck.items():
if len(TIDs) < min_support_count:
continue
new_TIDs = set()
tt = time()
for TID in TIDs:
comp_transaction = dataset[TID]
# equivalent to: if candidate.issubset(dataset[TID])
# this is the slowest part of the code and this is how to make it a
# bit faster
if not any(item not in comp_transaction for item in candidate):
new_TIDs.add(TID)
if len(new_TIDs) < min_support_count:
continue
Lk[candidate] = new_TIDs
return Lk
完整代码(抱歉评论不好):
from itertools import combinations
import pickle
from time import time
def has_infrequent_subset(candidate: set, previous_L: list) -> bool:
"""
A function to prone some of candidates
Parameters
----------
candidate -- a set to check whether all of its subsets are frequent or not.
if any subset is not frequent, the function will returns True,
otherwise returns False.
previous_L -- a list of tuples to check candidate's subsets against it.
an instance of previous_L could be found in 'Examples' part.
Returns
-------
a boolean value. True means there are some subsets in the candidate that
are not frequent with respect to previous_L and this value should no be
included in the final Ck result. False means all subsets are frequent and
we shall include this candidate in our Ck result.
Examples
--------
>>> previous_L = [(1,2,4),(2,3,6),(2,3,8),(2,6,7),(2,6,8),(3,4,5),(3,6,8)]
>>> has_infrequent_subset((2,3,6,8), previous_L)
False
>>> has_infrequent_subset((2,3,6,7), previous_L)
True
"""
subsets = combinations(candidate, len(candidate)-1)
for subset in subsets: # subset is a tuple
if subset not in previous_L:
return True
return False
def apriori_gen(previous_L: dict) -> dict:
"""
A function generate candidates with respect to Lk-1 (previous_L). tries
prone the results with the help of has_infrequent_subset(). for every new
candidate found, if all of its subsets with the length of k-1 are not
frequent in Lk-1 (previous_L), it will not be added to the result.
"""
Ck = {}
for item_1, TIDs1 in previous_L.items():
for item_2, TIDs2 in previous_L.items():
if item_1[:-1] == item_2[:-1] and item_1[-1] < item_2[-1]:
new_item = tuple([*item_1, item_2[-1]])
if has_infrequent_subset(new_item, previous_L):
continue
new_TIDs = TIDs1 if len(TIDs1) < len(TIDs2) else TIDs2
Ck[new_item] = new_TIDs
return Ck
def generate_L1(dataset: list, min_support_count: int) -> dict:
"""
Generates L1 itemset from given dataset with respect to min_support_count
Parameters
----------
dataset -- a list of lists. each inner list represent a transaction which
its content are items bought in that transacton. the outer list is the
dataset which contain all transactions.
min_support_count -- an integer which is used to check whether one item is
frequent or not.
Returns
-------
a dictionary with keys representing L1 frequent items fount and values
representing what transactions that item appeared in. the values are sets.
the values will be useful later as this paper demonstrates:
https://arxiv.org/pdf/1403.3948.pdf
Examples
--------
>>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 3)
{(3,): {0, 1, 2}}
>>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 2)
{(1,): {0, 1}, (3,): {0, 1, 2}, (4,): {1, 2}}
"""
L1 = {}
for TID, transaction in enumerate(dataset):
for item in transaction:
if (item,) not in L1:
L1[(item,)] = set()
L1[(item,)].add(TID)
return {item: TIDs for item, TIDs in L1.items()
if len(TIDs) >= min_support_count}
def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
st = time()
Lk = {}
for candidate, TIDs in Ck.items():
if len(TIDs) < min_support_count:
continue
new_TIDs = set()
tt = time()
for TID in TIDs:
comp_transaction = dataset[TID]
# equivalent to: if candidate.issubset(dataset[TID])
# this is the slowest part of the code and this is how to make it a
# bit faster
if not any(item not in comp_transaction for item in candidate):
new_TIDs.add(TID)
if len(new_TIDs) < min_support_count:
continue
Lk[candidate] = new_TIDs
return Lk
def apriori(min_support_count: int, dataset: list):
st = time()
L1 = generate_L1(dataset, min_support_count)
L = {1: L1}
for k in range(2, 1000):
if len(L[k-1]) < 2:
break
Ck = apriori_gen(L[k-1])
L[k] = gen_Lk(Ck, dataset, min_support_count)
return L
if __name__ == '__main__':
with open(paths.q1.listed_ds, 'rb') as f:
dataset = pickle.load(f)
L = apriori(len(dataset)*0.03, dataset)
result = []
for _, Lk in L.items():
result.extend(Lk.keys())
print(result, len(result))
你的作业可能有点晚了,但是:
计算 apriori_gen
中已有的新 TIDS
new_TIDs = TIDs1.intersection(TIDs2)
然后像这样重用新的 TID
def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
Lk = {}
for candidate, newTIDs in Ck.items():
if len(newTIDs) < min_support_count:
continue
Lk[candidate] = new_TIDs
return Lk
TLDR 问题
这是一个很长的问题,所以这里是一个 TLDR 版本:
我正在实施先验算法,它很慢。缓慢的部分是当我试图生成 Lk
形式 Ck
并且它必须扫描整个数据集(或多或少)以找出候选人是否频繁。我怎样才能使这部分更快?有没有加速的数据结构
重复搜索数据集?
长问题
我有一个作业要用 python 编写先验算法。约束是不使用 pandas
并且不使用任何实现 aperiori 的模块,如 apyori
。所以生成C1
和L1
根本不是问题。从 Lk-1
生成 Ck
也可以。整个事情的瓶颈是从 Ck
生成 Lk
。本节将每个候选者与整个数据集进行比较,该数据集需要 ages for small minimum_supports.
我花了一些时间搜索 apriori 的改进版本,在我能理解的版本中,这个是最好的(IMO):https://arxiv.org/pdf/1403.3948.pdf
本文提供了为每个项目保留一个交易列表,指示在哪些交易中出现了 item/itemset(我们称之为 found_in
)。有了它,我们可以减少从 Ck
生成 Lk
时的搜索次数,因为我们只能扫描该列表中提到的元素 (found_in
).
我实现了它,它减少了 4 倍左右的时间,这太棒了。然而,它对于作业来说还不够快,因为我应该提取 40,000 个频繁模式。
所以我在想也许算法很好,但我使用的 python 的数据结构太慢了,无法跟上。所以这是我的问题:
- 我可以通过使用更好的数据结构使这个算法更快吗?也许在
ctype
中? - 我的代码是否有任何问题导致它挂起?这个算法的结果对我来说看起来很合理(与
apyori
的输出相比) - 任何改善它或容易出现某些情况的提示?
我知道这个问题需要很长时间才能正确调查,我没想到会这样。所以任何小提示都将不胜感激。
慢的代码部分:
def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
subset_time = 0
Lk = {}
for candidate, TIDs in Ck.items():
if len(TIDs) < min_support_count:
continue
new_TIDs = set()
tt = time()
for TID in TIDs:
comp_transaction = dataset[TID]
# equivalent to: if candidate.issubset(dataset[TID])
# this is the slowest part of the code and this is how to make it a
# bit faster
if not any(item not in comp_transaction for item in candidate):
new_TIDs.add(TID)
if len(new_TIDs) < min_support_count:
continue
Lk[candidate] = new_TIDs
return Lk
完整代码(抱歉评论不好):
from itertools import combinations
import pickle
from time import time
def has_infrequent_subset(candidate: set, previous_L: list) -> bool:
"""
A function to prone some of candidates
Parameters
----------
candidate -- a set to check whether all of its subsets are frequent or not.
if any subset is not frequent, the function will returns True,
otherwise returns False.
previous_L -- a list of tuples to check candidate's subsets against it.
an instance of previous_L could be found in 'Examples' part.
Returns
-------
a boolean value. True means there are some subsets in the candidate that
are not frequent with respect to previous_L and this value should no be
included in the final Ck result. False means all subsets are frequent and
we shall include this candidate in our Ck result.
Examples
--------
>>> previous_L = [(1,2,4),(2,3,6),(2,3,8),(2,6,7),(2,6,8),(3,4,5),(3,6,8)]
>>> has_infrequent_subset((2,3,6,8), previous_L)
False
>>> has_infrequent_subset((2,3,6,7), previous_L)
True
"""
subsets = combinations(candidate, len(candidate)-1)
for subset in subsets: # subset is a tuple
if subset not in previous_L:
return True
return False
def apriori_gen(previous_L: dict) -> dict:
"""
A function generate candidates with respect to Lk-1 (previous_L). tries
prone the results with the help of has_infrequent_subset(). for every new
candidate found, if all of its subsets with the length of k-1 are not
frequent in Lk-1 (previous_L), it will not be added to the result.
"""
Ck = {}
for item_1, TIDs1 in previous_L.items():
for item_2, TIDs2 in previous_L.items():
if item_1[:-1] == item_2[:-1] and item_1[-1] < item_2[-1]:
new_item = tuple([*item_1, item_2[-1]])
if has_infrequent_subset(new_item, previous_L):
continue
new_TIDs = TIDs1 if len(TIDs1) < len(TIDs2) else TIDs2
Ck[new_item] = new_TIDs
return Ck
def generate_L1(dataset: list, min_support_count: int) -> dict:
"""
Generates L1 itemset from given dataset with respect to min_support_count
Parameters
----------
dataset -- a list of lists. each inner list represent a transaction which
its content are items bought in that transacton. the outer list is the
dataset which contain all transactions.
min_support_count -- an integer which is used to check whether one item is
frequent or not.
Returns
-------
a dictionary with keys representing L1 frequent items fount and values
representing what transactions that item appeared in. the values are sets.
the values will be useful later as this paper demonstrates:
https://arxiv.org/pdf/1403.3948.pdf
Examples
--------
>>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 3)
{(3,): {0, 1, 2}}
>>> generate_L1([[1,2,3], [3,4,1], [3,4,5]], 2)
{(1,): {0, 1}, (3,): {0, 1, 2}, (4,): {1, 2}}
"""
L1 = {}
for TID, transaction in enumerate(dataset):
for item in transaction:
if (item,) not in L1:
L1[(item,)] = set()
L1[(item,)].add(TID)
return {item: TIDs for item, TIDs in L1.items()
if len(TIDs) >= min_support_count}
def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
st = time()
Lk = {}
for candidate, TIDs in Ck.items():
if len(TIDs) < min_support_count:
continue
new_TIDs = set()
tt = time()
for TID in TIDs:
comp_transaction = dataset[TID]
# equivalent to: if candidate.issubset(dataset[TID])
# this is the slowest part of the code and this is how to make it a
# bit faster
if not any(item not in comp_transaction for item in candidate):
new_TIDs.add(TID)
if len(new_TIDs) < min_support_count:
continue
Lk[candidate] = new_TIDs
return Lk
def apriori(min_support_count: int, dataset: list):
st = time()
L1 = generate_L1(dataset, min_support_count)
L = {1: L1}
for k in range(2, 1000):
if len(L[k-1]) < 2:
break
Ck = apriori_gen(L[k-1])
L[k] = gen_Lk(Ck, dataset, min_support_count)
return L
if __name__ == '__main__':
with open(paths.q1.listed_ds, 'rb') as f:
dataset = pickle.load(f)
L = apriori(len(dataset)*0.03, dataset)
result = []
for _, Lk in L.items():
result.extend(Lk.keys())
print(result, len(result))
你的作业可能有点晚了,但是:
计算 apriori_gen
中已有的新 TIDSnew_TIDs = TIDs1.intersection(TIDs2)
然后像这样重用新的 TID
def gen_Lk(Ck: dict, dataset: list, min_support_count: int) -> dict:
Lk = {}
for candidate, newTIDs in Ck.items():
if len(newTIDs) < min_support_count:
continue
Lk[candidate] = new_TIDs
return Lk