如何在循环中过滤 pyspark 数据帧并附加到数据帧?
How do I filter a pyspark dataframe in a loop and append to a dataframe?
我有一个按列值过滤 pyspark 数据帧的函数。我想 运行 它在不同值的循环中,并将每个循环的输出附加到单个数据帧中。我目前的代码会覆盖每个循环的数据帧。我如何让它在每个循环中附加而不是覆盖?
这是我的 pyspark 数据框 (df):
+--------------+-------------------+------------------------+
|user_id |purchase_date_all |product |
+--------------+-------------------+------------------------+
|226575 |2018-04-04 17:41:23|12 months of global news|
|227729 |2018-04-19 16:50:09|2 months of global news|
|228544 |2018-04-28 17:01:16|18 months of global news|
|231795 |2018-06-11 20:27:48|36 months of global news|
|234206 |2018-07-19 00:52:10|12 months of global news|
|234607 |2018-07-23 20:41:47|12 months of global news|
|235133 |2018-07-30 02:34:58|12 months of global news|
|237883 |2018-08-07 18:52:53|1 months of global news |
|237924 |2018-08-08 01:31:13|6 months of global news |
|238892 |2018-08-14 02:45:51|9 months of global news |
|242200 |2018-08-19 21:22:05|3 months of global news |
|249034 |2018-10-11 15:01:06|16 months of global news|
|254415 |2018-12-28 12:13:18|16 months of global news|
|257317 |2019-02-09 18:49:12|11 months of global news|
+--------------+-------------------+------------------------+
这是我对 select 产品的功能,例如“12 个月的全球新闻”
def renewal_filter(df, n):
prod_type = str(n)+' months of global news'
df_first_xmo = df.filter(df.product == prod_type)
return df_first_xmo
如果我在循环中调用该函数,它会覆盖每个循环的数据帧。
month = [12, 2]
for x in month:
renewal_filter(df, x)
+--------------+-------------------+------------------------+
|user_id |purchase_date_all |product |
+--------------+-------------------+------------------------+
|226575 |2018-04-04 17:41:23|12 months of global news|
|234206 |2018-07-19 00:52:10|12 months of global news|
|234607 |2018-07-23 20:41:47|12 months of global news|
|235133 |2018-07-30 02:34:58|12 months of global news|
+--------------+-------------------+------------------------+
+--------------+-------------------+------------------------+
|user_id |purchase_date_all |product |
+--------------+-------------------+------------------------+
|227729 |2018-04-19 16:50:09|2 months of global news|
+--------------+-------------------+------------------------+
我如何更改循环逻辑以追加而不是在每个循环上覆盖数据帧,这样我才能得到这个结果?
+--------------+-------------------+------------------------+
|user_id |purchase_date_all |product |
+--------------+-------------------+------------------------+
|226575 |2018-04-04 17:41:23|12 months of global news|
|234206 |2018-07-19 00:52:10|12 months of global news|
|234607 |2018-07-23 20:41:47|12 months of global news|
|235133 |2018-07-30 02:34:58|12 months of global news|
|227729 |2018-04-19 16:50:09|2 months of global news|
+--------------+-------------------+------------------------+
在这里我给你完全不同的方法,不需要联合数据帧。
month = [12, 2]
import pyspark.sql.functions as f
df.withColumn('month', f.split('product', ' ')[0]) \
.filter(f.col('month').isin(month)) \
.show(10, False)
+-------+-------------------+------------------------+-----+
|user_id|purchase_date_all |product |month|
+-------+-------------------+------------------------+-----+
|228544 |2018-04-28 17:01:16|18 months of global news|18 |
|231795 |2018-06-11 20:27:48|36 months of global news|36 |
|237883 |2018-08-07 18:52:53|1 months of global news |1 |
|237924 |2018-08-08 01:31:13|6 months of global news |6 |
|238892 |2018-08-14 02:45:51|9 months of global news |9 |
|242200 |2018-08-19 21:22:05|3 months of global news |3 |
|249034 |2018-10-11 15:01:06|16 months of global news|16 |
|254415 |2018-12-28 12:13:18|16 months of global news|16 |
|257317 |2019-02-09 18:49:12|11 months of global news|11 |
+-------+-------------------+------------------------+-----+
我有一个按列值过滤 pyspark 数据帧的函数。我想 运行 它在不同值的循环中,并将每个循环的输出附加到单个数据帧中。我目前的代码会覆盖每个循环的数据帧。我如何让它在每个循环中附加而不是覆盖?
这是我的 pyspark 数据框 (df):
+--------------+-------------------+------------------------+
|user_id |purchase_date_all |product |
+--------------+-------------------+------------------------+
|226575 |2018-04-04 17:41:23|12 months of global news|
|227729 |2018-04-19 16:50:09|2 months of global news|
|228544 |2018-04-28 17:01:16|18 months of global news|
|231795 |2018-06-11 20:27:48|36 months of global news|
|234206 |2018-07-19 00:52:10|12 months of global news|
|234607 |2018-07-23 20:41:47|12 months of global news|
|235133 |2018-07-30 02:34:58|12 months of global news|
|237883 |2018-08-07 18:52:53|1 months of global news |
|237924 |2018-08-08 01:31:13|6 months of global news |
|238892 |2018-08-14 02:45:51|9 months of global news |
|242200 |2018-08-19 21:22:05|3 months of global news |
|249034 |2018-10-11 15:01:06|16 months of global news|
|254415 |2018-12-28 12:13:18|16 months of global news|
|257317 |2019-02-09 18:49:12|11 months of global news|
+--------------+-------------------+------------------------+
这是我对 select 产品的功能,例如“12 个月的全球新闻”
def renewal_filter(df, n):
prod_type = str(n)+' months of global news'
df_first_xmo = df.filter(df.product == prod_type)
return df_first_xmo
如果我在循环中调用该函数,它会覆盖每个循环的数据帧。
month = [12, 2]
for x in month:
renewal_filter(df, x)
+--------------+-------------------+------------------------+
|user_id |purchase_date_all |product |
+--------------+-------------------+------------------------+
|226575 |2018-04-04 17:41:23|12 months of global news|
|234206 |2018-07-19 00:52:10|12 months of global news|
|234607 |2018-07-23 20:41:47|12 months of global news|
|235133 |2018-07-30 02:34:58|12 months of global news|
+--------------+-------------------+------------------------+
+--------------+-------------------+------------------------+
|user_id |purchase_date_all |product |
+--------------+-------------------+------------------------+
|227729 |2018-04-19 16:50:09|2 months of global news|
+--------------+-------------------+------------------------+
我如何更改循环逻辑以追加而不是在每个循环上覆盖数据帧,这样我才能得到这个结果?
+--------------+-------------------+------------------------+
|user_id |purchase_date_all |product |
+--------------+-------------------+------------------------+
|226575 |2018-04-04 17:41:23|12 months of global news|
|234206 |2018-07-19 00:52:10|12 months of global news|
|234607 |2018-07-23 20:41:47|12 months of global news|
|235133 |2018-07-30 02:34:58|12 months of global news|
|227729 |2018-04-19 16:50:09|2 months of global news|
+--------------+-------------------+------------------------+
在这里我给你完全不同的方法,不需要联合数据帧。
month = [12, 2]
import pyspark.sql.functions as f
df.withColumn('month', f.split('product', ' ')[0]) \
.filter(f.col('month').isin(month)) \
.show(10, False)
+-------+-------------------+------------------------+-----+
|user_id|purchase_date_all |product |month|
+-------+-------------------+------------------------+-----+
|228544 |2018-04-28 17:01:16|18 months of global news|18 |
|231795 |2018-06-11 20:27:48|36 months of global news|36 |
|237883 |2018-08-07 18:52:53|1 months of global news |1 |
|237924 |2018-08-08 01:31:13|6 months of global news |6 |
|238892 |2018-08-14 02:45:51|9 months of global news |9 |
|242200 |2018-08-19 21:22:05|3 months of global news |3 |
|249034 |2018-10-11 15:01:06|16 months of global news|16 |
|254415 |2018-12-28 12:13:18|16 months of global news|16 |
|257317 |2019-02-09 18:49:12|11 months of global news|11 |
+-------+-------------------+------------------------+-----+