对 pyspark 列进行编码,创建另一列阶乘值
encode pyspark column creating another column of factorial values
我有以下 pyspark 数据框:
+----------------------+
| Paths |
+----------------------+
|[link1, link2, link3] |
|[link1, link2, link4] |
|[link1, link2, link3] |
|[link1, link2, link4] |
...
..
.
+----------------------+
我想将路径编码为分类变量并将此信息添加到数据帧中。结果应该是这样的:
+----------------------+----------------------+
| Paths | encodedPaths |
+----------------------+----------------------+
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
...
..
.
+----------------------+
环顾四周,我找到了这个解决方案:
indexer = pathsDF.select("Paths").distinct().withColumn("encodedPaths", F.monotonically_increasing_id())
pathsDF = pathsDF.join(indexer, "Paths")
它应该可以工作,但原始数据帧和生成的数据帧之间不同路径的数量不同。除此之外,编码列中的某些值明显高于不同路径的数量。这应该是不可能的,因为 monotonically_increasing 函数应该线性递增。
您还有其他解决方案吗?
在将数组列转换为字符串后,您可以使用 ml - lib 中的 StringIndexer:
from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol="PathsStr", outputCol="encodedPaths")
df2 = df.withColumn("PathsStr",F.col("Paths").cast("string"))
#or df2 = df.withColumn("PathsStr",F.concat_ws(",","Paths"))
out = stringIndexer.fit(df2).transform(df2)\
.withColumn("encodedPaths",F.col("encodedPaths")+1)\
.select(*df.columns,"encodedPaths")
out.show(truncate=False)
+---------------------+------------+
|Paths |encodedPaths|
+---------------------+------------+
|[link1, link2, link3]|1.0 |
|[link1, link2, link4]|2.0 |
|[link1, link2, link3]|1.0 |
|[link1, link2, link4]|2.0 |
+---------------------+------------+
我有以下 pyspark 数据框:
+----------------------+
| Paths |
+----------------------+
|[link1, link2, link3] |
|[link1, link2, link4] |
|[link1, link2, link3] |
|[link1, link2, link4] |
...
..
.
+----------------------+
我想将路径编码为分类变量并将此信息添加到数据帧中。结果应该是这样的:
+----------------------+----------------------+
| Paths | encodedPaths |
+----------------------+----------------------+
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
...
..
.
+----------------------+
环顾四周,我找到了这个解决方案:
indexer = pathsDF.select("Paths").distinct().withColumn("encodedPaths", F.monotonically_increasing_id())
pathsDF = pathsDF.join(indexer, "Paths")
它应该可以工作,但原始数据帧和生成的数据帧之间不同路径的数量不同。除此之外,编码列中的某些值明显高于不同路径的数量。这应该是不可能的,因为 monotonically_increasing 函数应该线性递增。 您还有其他解决方案吗?
在将数组列转换为字符串后,您可以使用 ml - lib 中的 StringIndexer:
from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol="PathsStr", outputCol="encodedPaths")
df2 = df.withColumn("PathsStr",F.col("Paths").cast("string"))
#or df2 = df.withColumn("PathsStr",F.concat_ws(",","Paths"))
out = stringIndexer.fit(df2).transform(df2)\
.withColumn("encodedPaths",F.col("encodedPaths")+1)\
.select(*df.columns,"encodedPaths")
out.show(truncate=False)
+---------------------+------------+
|Paths |encodedPaths|
+---------------------+------------+
|[link1, link2, link3]|1.0 |
|[link1, link2, link4]|2.0 |
|[link1, link2, link3]|1.0 |
|[link1, link2, link4]|2.0 |
+---------------------+------------+