Python 中的并行处理 efficient_apriori 代码
Parallel-processing efficient_apriori code in Python
我有 1200 万条来自网店的数据。我想使用 efficient_apriori 包计算关联规则。 The problem is that 12 millions observations are too many
,所以计算花费了太多时间。有没有办法加快算法的速度?我正在考虑一些并行处理或将 python 代码编译成 C。我尝试了 PYPY,但 PYPY 不支持 pandas 包。感谢您提供任何帮助或想法。
如果你想看我的代码:
import pandas as pd
from efficient_apriori import apriori
orders = pd.read_csv("orders.csv", sep=";")
customer = orders.groupby("id_customer")["name"].agg(tuple).tolist()
itemsets, rules = apriori(
customer, min_support=100/len(customer), min_confidence=0
)
你能用这种方法来 运行 这个任务并行吗:
from multiprocessing import Pool
length_of_input_file=len(raw_data_min)
total_offset_count=4 # number of parallel process to run
offset=int(length_of_input_file/total_offset_count // 1)
dataNew1=customer[0:offset-1]
dataNew2=customer[offset:2*offset-1]
dataNew3=customer[2*offset:3*offset-1]
dataNew4=customer[3*offset:4*offset-1]
def calculate_frequent_itemset(fractional_data):
"""Function that calculated the frequent dataset parallely"""
itemsets, rules = apriori(fractional_data, min_support=MIN_SUPPORT,
min_confidence=MIN_CONFIDENCE)
return itemsets, rules
p=Pool()
frequent_itemsets=p.map(calculate_frequent_itemset,(dataNew1,dataNew2,dataNew3,dataNew4))
p.close()
p.join()
itemsets1, rules1 =frequent_itemsets[0]
itemsets2, rules2=frequent_itemsets[1]
itemsets3, rules3=frequent_itemsets[2]
itemsets4, rules4=frequent_itemsets[3]
我有 1200 万条来自网店的数据。我想使用 efficient_apriori 包计算关联规则。 The problem is that 12 millions observations are too many
,所以计算花费了太多时间。有没有办法加快算法的速度?我正在考虑一些并行处理或将 python 代码编译成 C。我尝试了 PYPY,但 PYPY 不支持 pandas 包。感谢您提供任何帮助或想法。
如果你想看我的代码:
import pandas as pd
from efficient_apriori import apriori
orders = pd.read_csv("orders.csv", sep=";")
customer = orders.groupby("id_customer")["name"].agg(tuple).tolist()
itemsets, rules = apriori(
customer, min_support=100/len(customer), min_confidence=0
)
你能用这种方法来 运行 这个任务并行吗:
from multiprocessing import Pool
length_of_input_file=len(raw_data_min)
total_offset_count=4 # number of parallel process to run
offset=int(length_of_input_file/total_offset_count // 1)
dataNew1=customer[0:offset-1]
dataNew2=customer[offset:2*offset-1]
dataNew3=customer[2*offset:3*offset-1]
dataNew4=customer[3*offset:4*offset-1]
def calculate_frequent_itemset(fractional_data):
"""Function that calculated the frequent dataset parallely"""
itemsets, rules = apriori(fractional_data, min_support=MIN_SUPPORT,
min_confidence=MIN_CONFIDENCE)
return itemsets, rules
p=Pool()
frequent_itemsets=p.map(calculate_frequent_itemset,(dataNew1,dataNew2,dataNew3,dataNew4))
p.close()
p.join()
itemsets1, rules1 =frequent_itemsets[0]
itemsets2, rules2=frequent_itemsets[1]
itemsets3, rules3=frequent_itemsets[2]
itemsets4, rules4=frequent_itemsets[3]