如何根据 pyspark 中特定列的唯一值生成列？

Question

我有如下数据框

+----------+------------+---------------------+
|CustomerNo|size        |total_items_purchased|
+----------+------------+---------------------+
|  208261.0|          A |                    2|
|  208263.0|          C |                    1|
|  208261.0|          E |                    1|
|  208262.0|          B |                    2|
|  208264.0|          D |                    3|
+----------+------------+---------------------+

我有另一个 table df，它只包含客户编号。我必须创建独特的 comfortStyles 列，并且必须更新 df

中的 total_items_purchased

我的 df table 应该是这样的

CustomerNo,size_A,size_B,size_C,size_D,size_E
208261.0     1      0      0      0    1
208262.0     0      2      0      0    0
208263.0     0      0      1      0    0
208264.0     0      0      0      3    0

谁能告诉我怎么做？

Answer 1

您可以使用 pivot 函数重新排列 table。

df = (df.groupBy('CustomerNo')
      .pivot('size')
      .agg(F.first('total_items_purchased'))
      .na.fill(0))

如何根据 pyspark 中特定列的唯一值生成列？

How to generate the columns based on the unique values of that particular column in pyspark?

pyspark