对 pyspark 列进行编码,创建另一列阶乘值

encode pyspark column creating another column of factorial values

我有以下 pyspark 数据框:

+----------------------+
|        Paths         |
+----------------------+
|[link1, link2, link3] |               
|[link1, link2, link4] |          
|[link1, link2, link3] |              
|[link1, link2, link4] |
...
..
. 
+----------------------+

我想将路径编码为分类变量并将此信息添加到数据帧中。结果应该是这样的:

+----------------------+----------------------+
|        Paths         |      encodedPaths    |
+----------------------+----------------------+
|[link1, link2, link3] |          1           |     
|[link1, link2, link4] |          2           |
|[link1, link2, link3] |          1           |
|[link1, link2, link4] |          2           |
...
..
. 
+----------------------+

环顾四周,我找到了这个解决方案:

indexer = pathsDF.select("Paths").distinct().withColumn("encodedPaths", F.monotonically_increasing_id())
pathsDF = pathsDF.join(indexer, "Paths")

它应该可以工作,但原始数据帧和生成的数据帧之间不同路径的数量不同。除此之外,编码列中的某些值明显高于不同路径的数量。这应该是不可能的,因为 monotonically_increasing 函数应该线性递增。 您还有其他解决方案吗?

在将数组列转换为字符串后,您可以使用 ml - lib 中的 StringIndexer:

from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol="PathsStr", outputCol="encodedPaths")

df2 = df.withColumn("PathsStr",F.col("Paths").cast("string"))
#or df2 = df.withColumn("PathsStr",F.concat_ws(",","Paths"))

out = stringIndexer.fit(df2).transform(df2)\
     .withColumn("encodedPaths",F.col("encodedPaths")+1)\
      .select(*df.columns,"encodedPaths")

out.show(truncate=False)
+---------------------+------------+
|Paths                |encodedPaths|
+---------------------+------------+
|[link1, link2, link3]|1.0         |
|[link1, link2, link4]|2.0         |
|[link1, link2, link3]|1.0         |
|[link1, link2, link4]|2.0         |
+---------------------+------------+