最大模式长度 fpGrowth (Apache) PySpark
Maximum Pattern Length fpGrowth (Apache) PySpark
我正在尝试 运行 使用 PySpark 的关联规则。我首先创建一个 FPGrowth 树并将其传递给关联规则方法。
但是,我希望添加一个最大模式长度参数,以限制我想要在左轴和右轴上显示的项目数。对于项目之间的关联,我只想将模式长度保持为 2。
## fit model
from pyspark.ml.fpm import FPGrowth
fpGrowth_1 = FPGrowth(itemsCol="collect_set(title_name)", minSupport=.001, minConfidence=0.001)
model_working_1 = fpGrowth_1.fit(transactions_2)
## Display frequent itemsets.
model_working_1.freqItemsets.show()
+--------------------+------+
| items| freq|
+--------------------+------+
|[Temptation Islan...|325291|
|[Temptation Island] |282205|
|[Temptation Islan...|175694|
|[S4 - Engl progr...|171400|
| [Nieuwe Buren]|168684|
|[Neighboursss, Te...|113113|
| [Love Island]|146766|
|[Love Island, S4 ...| 65285|
|[Love Island, Tem...|105834|
|[Love Island, Tem...| 83335|
|[Love Island, Tem...|115979|
|[Good Time Sle......|132439|
+--------------------+------+
# Display generated association rules.
model_working_1.associationRules.show()
+--------------------+--------------------+------------------+
| antecedent| consequent| confidence|
+--------------------+--------------------+------------------+
|[Love Island, Tem...| [Temptation Island]|0.7185352520714957|
|[De Beste Verleid...|[Temptation Islan...|0.9147820487266372|
| [Bella Donna's]|[Temptation Islan...| 0.74988107580655|
|[Binnenkort bij V...|[Temptation Islan...|0.9756179956817415|
|[Married at First...| [Temptation Island]|0.8692627446452283|
| [Love Island]| [Temptation Island]|0.7211070683945873|
| [Love Island]|[Temptation Islan...|0.7902307073845442|
|[S4 - Dutch progr...| [Temptation Island]| 0.61975495915986|
|[S4 - Dutch progr...|[Temptation Islan...|0.7550758459743291|
|[The Good Doctor,...| [Temptation Island]| 0.873575189492565|
+--------------------+--------------------+------------------+
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model_working_1.transform(transactions_2).show()
+---------------------+----------------------------------------------------------------------------------------------+
| title_name | Prediction |
+---------------------+----------------------------------------------------------------------------------------------+
|[Goode Time Bad ....| Temptation Island VIPS,S4 - Dutch program viewer,Weg van Jou |
The Good Doctor,Moordvrouw,De 12 van Oldenheim,Married at First Sight,Dave en Dien op Ibiza,Temptation Gossip] |
|[S4 - Englis progr...|Lara Croft Tomb Raider, Ronald Goedemondt - Geen Sp
|[Goede Tijden Sl.........|[I Love You Tattoo, S7 - Dutch suspense-series viewer, Temptation Island VIPS, Awkward, Goede Tijden Slechte Tijden, Lost, De Beste Verleiders, Cellblock H]|
生成的关联规则是非常长的模式。我想将长度保持在 2 个模式或更多一点。现在我要解释或理解太多了。
有什么方法可以限制 PySPark 中的模式长度?我为 scala 找到了一个 link 但在 PySaprk 中没有这样的东西。
如果你能在这种情况下 suggest/help 我将不胜感激。提前致谢!!!
在pyspark
你可以试试:
from pyspark.sql.functions import col, size
model.associationRules.where(size(col('antecedent')) == 1).where(size(col('cosequent')) == 1).show()
我正在尝试 运行 使用 PySpark 的关联规则。我首先创建一个 FPGrowth 树并将其传递给关联规则方法。
但是,我希望添加一个最大模式长度参数,以限制我想要在左轴和右轴上显示的项目数。对于项目之间的关联,我只想将模式长度保持为 2。
## fit model
from pyspark.ml.fpm import FPGrowth
fpGrowth_1 = FPGrowth(itemsCol="collect_set(title_name)", minSupport=.001, minConfidence=0.001)
model_working_1 = fpGrowth_1.fit(transactions_2)
## Display frequent itemsets.
model_working_1.freqItemsets.show()
+--------------------+------+
| items| freq|
+--------------------+------+
|[Temptation Islan...|325291|
|[Temptation Island] |282205|
|[Temptation Islan...|175694|
|[S4 - Engl progr...|171400|
| [Nieuwe Buren]|168684|
|[Neighboursss, Te...|113113|
| [Love Island]|146766|
|[Love Island, S4 ...| 65285|
|[Love Island, Tem...|105834|
|[Love Island, Tem...| 83335|
|[Love Island, Tem...|115979|
|[Good Time Sle......|132439|
+--------------------+------+
# Display generated association rules.
model_working_1.associationRules.show()
+--------------------+--------------------+------------------+
| antecedent| consequent| confidence|
+--------------------+--------------------+------------------+
|[Love Island, Tem...| [Temptation Island]|0.7185352520714957|
|[De Beste Verleid...|[Temptation Islan...|0.9147820487266372|
| [Bella Donna's]|[Temptation Islan...| 0.74988107580655|
|[Binnenkort bij V...|[Temptation Islan...|0.9756179956817415|
|[Married at First...| [Temptation Island]|0.8692627446452283|
| [Love Island]| [Temptation Island]|0.7211070683945873|
| [Love Island]|[Temptation Islan...|0.7902307073845442|
|[S4 - Dutch progr...| [Temptation Island]| 0.61975495915986|
|[S4 - Dutch progr...|[Temptation Islan...|0.7550758459743291|
|[The Good Doctor,...| [Temptation Island]| 0.873575189492565|
+--------------------+--------------------+------------------+
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model_working_1.transform(transactions_2).show()
+---------------------+----------------------------------------------------------------------------------------------+
| title_name | Prediction |
+---------------------+----------------------------------------------------------------------------------------------+
|[Goode Time Bad ....| Temptation Island VIPS,S4 - Dutch program viewer,Weg van Jou |
The Good Doctor,Moordvrouw,De 12 van Oldenheim,Married at First Sight,Dave en Dien op Ibiza,Temptation Gossip] |
|[S4 - Englis progr...|Lara Croft Tomb Raider, Ronald Goedemondt - Geen Sp
|[Goede Tijden Sl.........|[I Love You Tattoo, S7 - Dutch suspense-series viewer, Temptation Island VIPS, Awkward, Goede Tijden Slechte Tijden, Lost, De Beste Verleiders, Cellblock H]|
生成的关联规则是非常长的模式。我想将长度保持在 2 个模式或更多一点。现在我要解释或理解太多了。
有什么方法可以限制 PySPark 中的模式长度?我为 scala
如果你能在这种情况下 suggest/help 我将不胜感激。提前致谢!!!
在pyspark
你可以试试:
from pyspark.sql.functions import col, size
model.associationRules.where(size(col('antecedent')) == 1).where(size(col('cosequent')) == 1).show()