Spark - 单列到 X 列类

Question

目前，我有一个单列的数据框，如下所示

 color
 -----
 green
 blue
 green
 red
 yellow
 red
 orange

等等...（30种不同的颜色）。

从该专栏，我想将其转换为类似于此的 DataFrame

green blue red yellow orange purple ... more colors
  1     0   0     0     0       0
  0     1   0     0     0       0
  1     0   0     0     0       0
  0     0   1     0     0       0
  0     0   0     1     0       0
  0     0   1     0     0       0
  0     0   0     0     1       0

每个变量都设置为 0 的 DataFrame，除了与原始列位于同一索引上的颜色。

到目前为止，我已经尝试了不同的函数和解决方案，其中 none 有效（而且代码看起来真的很乱）。我想知道是否有 "easy" 或简单的方法来执行此操作，或者我应该使用另一个库，如 Pandas（我正在使用 Python）。如果你知道 R，那么我想要的是 table 函数。

谢谢

Answer 1

像这样应该可以解决问题：

from pyspark.sql.functions import when, lit, col

colors = df.select("color").distinct().map(lambda x: x[0]).collect()
cols = (
    when(col("color") == lit(color), 1).otherwise(0).alias(color)
    for color in colors
)

df.select(*cols)

如果您正在寻找类似于 R table 的另一种解决方案，您可能想看看 crosstab and cube。

备注

当级别数很大时，创建密集数据框变得相当低效。在这种情况下，您应该考虑使用稀疏向量：

from pyspark.sql import Row
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import StringIndexer

def toVector(n): 
    def _toVector(i):
        return Row("vec")(Vectors.sparse(n, {i: 1.0}))
    return _toVector

indexer = StringIndexer(inputCol="color", outputCol="colorIdx")
indexed = indexer.fit(df).transform(df)
n = indexed.select("colorIdx").distinct().count()

vectorized = indexed.select("colorIdx").map(toVector(n)).toDF()

Spark - 单列到 X 列类

Spark - Single column to columns of X classes

python

apache-spark

pyspark

Spark - 单列到 X 列 类

Spark - Single column to columns of X classes

python

apache-spark

pyspark

Spark - 单列到 X 列类