当使用 Spark 存在数据倾斜时如何找到基于百分位数的阈值?
How to find percentile based threshold when there is a data skew using Spark?
我有一个数据集 -
我试图找到 similarity_score 的阈值(变体与原始变体的相似程度),我可以用它来过滤掉不太相关的变体。 similarity_score 越高 - 越相似。
+--------+----------------------+---------------------+
|original|variation |similarity_score |
+--------+----------------------+---------------------+
|pencils |pencils |0.982669767327994 |
|pencils |pencils ticonderoga |0.8875609147113148 |
|pencils |pencils bulk |0.5536876549778099 |
|pencils |pencils for kids |0.39102614317876977 |
|pencils |mechanical pencils |0.3837525124443511 |
|pencils |school supplies |0.36800207529412093 |
|pencils |pencils mechanical |0.20423055289450207 |
|pencils |black pencils |0.08241323295822053 |
|pencils |erasers |0.08101804016695552 |
|pencils |papermate pencils |0.08091683972455012 |
|pencils |pensils for kids |0.07299289422964994 |
|pencils |loose leaf paper |0.07113136587149338 |
|pencils |pencil sharpener |0.0684130629091813 |
|pencils |pencils with sayings |0.06350472916694737 |
|pencils |cute pencils |0.058316002229988215 |
|pencils |colored pencils |0.05552992878175486 |
|pencils |pensils sharpener |0.0491689934641725 |
|pencils |pens |0.048082618970934014 |
|pencils |mechanical pencils 0.5|0.04730075285308284 |
|pencils |pencil box |0.04537707727651545 |
|pencils |pencil case |0.04408623373105654 |
|pencils |pensils 0.5 |0.042054450870644036 |
|pencils |crayons |0.03968021119078101 |
|pencils |cool pencils |0.03952088726420226 |
|pencils |pen |0.037752562111173546 |
|pencils |cool pensils |0.037510831184747497 |
|pencils |glue sticks |0.032787653384488386 |
|pencils |drawing pencils |0.032398405257143804 |
|pencils |cute pensils |0.031057982991620214 |
|pencils |cool penciles |0.02964868546657146 |
|pencils |art pencils |0.027646328736291175 |
|pencils |index cards |0.02744109743018355 |
|pencils |folders |0.027070809949858353 |
|pencils |pensil cases |0.0245633981485688 |
|pencils |golf pencils |0.02411058428627058 |
|pencils |notebooks |0.023534237276835582 |
|pencils |binder |0.02262728927378412 |
|pencils |pens bulk |0.021022196010267332 |
|pencils |folders with pockets |0.020883099832185503 |
|pencils |amazon basics pens |0.02010866875890696 |
|pencils |notebook |0.02002518175509854 |
|pencils |pencil grips |0.01712161261500511 |
|pencils |scissors |0.014319987175720715 |
|pencils |paper |0.013844962316376306 |
|pencils |tissues |0.012853953323240916 |
|pencils |sketch book |0.011383006178195793 |
|pencils |colored pens |0.010608837622836953 |
|pencils |printer paper |0.007159662743211316 |
|pencils |pencil holder |0.005219375924877335 |
|pencils |copy paper |0.0048398059676751275|
|pencils |paper clips |0.004323946307993657 |
|pencils |backpack |0.002456516357354455 |
|pencils |amazonbasics |0.002386583749856832 |
|pencils |desk organizer |0.0014871301773208652|
|pencils |fun penciles |7.114920197678272E-4 |
|pencils |pensils crayola |5.863935694892475E-4 |
|pencils |paper towels |4.956030568975998E-4 |
|pencils |pensils usa |3.074806676441355E-4 |
|pencils |golf pensils |1.6343203531543173E-4|
|pencils |carpenter penciles |1.537148743666664E-4 |
|pencils |usb extension cable |9.276356621610018E-5 |
+--------+----------------------+---------------------+
当我尝试时 -
spark.sql("select percentile_approx(similarity_score, 0.3) as threshold from Test").show()
+--------------------+
| threshold|
+--------------------+
|0.014319987175720715|
+--------------------+
但这里的问题是,这适用于正态分布的数据,但从上面的示例来看 - 您可以使用存在数据倾斜的情况,而 0.01 不适合此情况。有没有办法找到这样的异常值的阈值?
您可以选择在 运行 百分位数之前消除异常值(但请记住,PERCENTILE
作为函数是处理异常值的好方法。
无论如何,要减少分布的尾部,您可以应用一些过滤操作。
首先,创建一个Window分区:
import org.apache.spark.sql.expressions.Window
val byGroup = Window.partitionBy(lit(1))
然后,削减 1.96 方差以外的任何内容:
val results = spark.sql("""SELECT value FROM table """)
.withColumn("mn", avg($"value") over byGroup)
.withColumn("v", var_pop($"value") over byGroup)
.filter( $"mn" + lit(1.96) * $"v" >= $"value" )
.filter( $"mn" - lit(1.96) * $"v" <= $"value" )
这应该在尾巴的每一侧削减 5% 的分布。有关详细信息,请参阅此处:https://en.wikipedia.org/wiki/1.96
我有一个数据集 -
我试图找到 similarity_score 的阈值(变体与原始变体的相似程度),我可以用它来过滤掉不太相关的变体。 similarity_score 越高 - 越相似。
+--------+----------------------+---------------------+
|original|variation |similarity_score |
+--------+----------------------+---------------------+
|pencils |pencils |0.982669767327994 |
|pencils |pencils ticonderoga |0.8875609147113148 |
|pencils |pencils bulk |0.5536876549778099 |
|pencils |pencils for kids |0.39102614317876977 |
|pencils |mechanical pencils |0.3837525124443511 |
|pencils |school supplies |0.36800207529412093 |
|pencils |pencils mechanical |0.20423055289450207 |
|pencils |black pencils |0.08241323295822053 |
|pencils |erasers |0.08101804016695552 |
|pencils |papermate pencils |0.08091683972455012 |
|pencils |pensils for kids |0.07299289422964994 |
|pencils |loose leaf paper |0.07113136587149338 |
|pencils |pencil sharpener |0.0684130629091813 |
|pencils |pencils with sayings |0.06350472916694737 |
|pencils |cute pencils |0.058316002229988215 |
|pencils |colored pencils |0.05552992878175486 |
|pencils |pensils sharpener |0.0491689934641725 |
|pencils |pens |0.048082618970934014 |
|pencils |mechanical pencils 0.5|0.04730075285308284 |
|pencils |pencil box |0.04537707727651545 |
|pencils |pencil case |0.04408623373105654 |
|pencils |pensils 0.5 |0.042054450870644036 |
|pencils |crayons |0.03968021119078101 |
|pencils |cool pencils |0.03952088726420226 |
|pencils |pen |0.037752562111173546 |
|pencils |cool pensils |0.037510831184747497 |
|pencils |glue sticks |0.032787653384488386 |
|pencils |drawing pencils |0.032398405257143804 |
|pencils |cute pensils |0.031057982991620214 |
|pencils |cool penciles |0.02964868546657146 |
|pencils |art pencils |0.027646328736291175 |
|pencils |index cards |0.02744109743018355 |
|pencils |folders |0.027070809949858353 |
|pencils |pensil cases |0.0245633981485688 |
|pencils |golf pencils |0.02411058428627058 |
|pencils |notebooks |0.023534237276835582 |
|pencils |binder |0.02262728927378412 |
|pencils |pens bulk |0.021022196010267332 |
|pencils |folders with pockets |0.020883099832185503 |
|pencils |amazon basics pens |0.02010866875890696 |
|pencils |notebook |0.02002518175509854 |
|pencils |pencil grips |0.01712161261500511 |
|pencils |scissors |0.014319987175720715 |
|pencils |paper |0.013844962316376306 |
|pencils |tissues |0.012853953323240916 |
|pencils |sketch book |0.011383006178195793 |
|pencils |colored pens |0.010608837622836953 |
|pencils |printer paper |0.007159662743211316 |
|pencils |pencil holder |0.005219375924877335 |
|pencils |copy paper |0.0048398059676751275|
|pencils |paper clips |0.004323946307993657 |
|pencils |backpack |0.002456516357354455 |
|pencils |amazonbasics |0.002386583749856832 |
|pencils |desk organizer |0.0014871301773208652|
|pencils |fun penciles |7.114920197678272E-4 |
|pencils |pensils crayola |5.863935694892475E-4 |
|pencils |paper towels |4.956030568975998E-4 |
|pencils |pensils usa |3.074806676441355E-4 |
|pencils |golf pensils |1.6343203531543173E-4|
|pencils |carpenter penciles |1.537148743666664E-4 |
|pencils |usb extension cable |9.276356621610018E-5 |
+--------+----------------------+---------------------+
当我尝试时 -
spark.sql("select percentile_approx(similarity_score, 0.3) as threshold from Test").show()
+--------------------+
| threshold|
+--------------------+
|0.014319987175720715|
+--------------------+
但这里的问题是,这适用于正态分布的数据,但从上面的示例来看 - 您可以使用存在数据倾斜的情况,而 0.01 不适合此情况。有没有办法找到这样的异常值的阈值?
您可以选择在 运行 百分位数之前消除异常值(但请记住,PERCENTILE
作为函数是处理异常值的好方法。
无论如何,要减少分布的尾部,您可以应用一些过滤操作。
首先,创建一个Window分区:
import org.apache.spark.sql.expressions.Window
val byGroup = Window.partitionBy(lit(1))
然后,削减 1.96 方差以外的任何内容:
val results = spark.sql("""SELECT value FROM table """)
.withColumn("mn", avg($"value") over byGroup)
.withColumn("v", var_pop($"value") over byGroup)
.filter( $"mn" + lit(1.96) * $"v" >= $"value" )
.filter( $"mn" - lit(1.96) * $"v" <= $"value" )
这应该在尾巴的每一侧削减 5% 的分布。有关详细信息,请参阅此处:https://en.wikipedia.org/wiki/1.96