如何将每个 user_id 的第一行列的值复制到相同 user_id 的第二行

Question

我有以下 pyspark 数据框：

+-------+---------------------+---+----------+--------------+
|user_id|             product | rn|Product_Mo|First_Purchase|
+-------+---------------------+---+----------+--------------+
| 246981|6 month subscription |  1|         6|          null|
| 246981|12 month subscription|  2|        12|          null|
| 249357|6 month subscription |  1|         6|          null|
| 249357|3 month subscription |  2|         3|          null|
| 243532|6 month subscription |  1|         6|          null|
| 243532|3 month subscription |  2|         3|          null|
| 257345|6 month subscription |  1|         6|          null|
| 257345|2 month subscription |  2|         2|          null|
| 256355|6 month subscription |  1|         6|          null|
| 256355|12 month subscription|  2|        12|          null|
| 246701|6 month subscription |  1|         6|          null|
| 246701|12 month subscription|  2|        12|          null|
| 254082|6 month subscription |  1|         6|          null|
| 254082|12 month subscription|  2|        12|          null|
| 239210|6 month subscription |  1|         6|          null|
| 239210|12 month subscription|  2|        12|          null|
| 247518|6 month subscription |  1|         6|          null|
| 247518|12 month subscription|  2|        12|          null|
+-------+---------------------+---+----------+--------------+

我需要捕获 Product_Mo 的值，其中 rn = 1，并将其复制到 First_Purchase，其中 rn = 1 以及 rn = 2。这将允许我稍后在 First_purchase 上执行 groupby 并计算所有第一次和第二次购买，其中首先购买了 6 个月的订阅。

生成的数据框应如下所示：

+-------+---------------------+---+----------+--------------+
|user_id|             product | rn|Product_Mo|First_Purchase|
+-------+---------------------+---+----------+--------------+
| 246981|6 month subscription |  1|         6|             6|
| 246981|12 month subscription|  2|        12|             6|
| 249357|6 month subscription |  1|         6|             6|
| 249357|3 month subscription |  2|         3|             6|
| 243532|6 month subscription |  1|         6|             6|
| 243532|3 month subscription |  2|         3|             6|
| 257345|6 month subscription |  1|         6|             6|
| 257345|2 month subscription |  2|         2|             6|
| 256355|6 month subscription |  1|         6|             6|
| 256355|12 month subscription|  2|        12|             6|
| 246701|6 month subscription |  1|         6|             6|
| 246701|12 month subscription|  2|        12|             6|
| 254082|6 month subscription |  1|         6|             6|
| 254082|12 month subscription|  2|        12|             6|
| 239210|6 month subscription |  1|         6|             6|
| 239210|12 month subscription|  2|        12|             6|
| 247518|6 month subscription |  1|         6|             6|
| 247518|12 month subscription|  2|        12|             6|
+-------+---------------------+---+----------+--------------+

我还没有弄清楚如何捕获 Product_Mo 的值，其中 rn = 1 并将其复制到 First_Purchase，其中 rn = 1 以及 rn = 2。 Product_Mo 其中 rn=1 可以在后续循环中更改。所以我需要复制那个值，不管它是什么。它不会总是 6.

我希望这是有道理的。感谢任何建议。

Answer 1

在 Window 分区上使用 first 函数 user_id 并按 rn 排序。

from pyspark.sql import Window
from pyspark.sql.functions import *

w = Window.partitionBy('user_id').orderBy('rn')
df.withColumn('First_Purchase', first('Product_Mo').over(w))

如何将每个 user_id 的第一行列的值复制到相同 user_id 的第二行

How can I copy value from column on first row per user_id to to second row for same user_id

apache-spark

pyspark

pyspark-dataframes