Pyspark 与 When 的逻辑

Logical with Pyspark with When

我有以下数据框:

customer_id person_id type_person type_person2 insert_date2 anterior_type update_date
abcdefghijklmnopqrst 4a5ae8a5-6682-467 Online Online 2022-03-02 null null
abcdefghijklmnopqrst 1be8d3e8-8075-438 Online Online 2022-03-02 null null
abcdefghijklmnopqrst 6912dadc-1692-4bd Online Offline 2022-03-02 Online 2022-03-03
abcdefghijklmnopqrst e48cba37-113c-4bd Online Online 2022-03-02 null null
abcdefghijklmnopqrst 831cb669-b2ae-4e8 Online Online 2022-03-02 null null
abcdefghijklmnopqrst 69161fe5-62ac-400 Online Online 2022-03-02 null null
abcdefghijklmnopqrst b48b59a0-92eb-410 Online Online 2022-03-02 null null

我需要查看“type_person”和“type_person2”列,并使用以下规则创建一个新列:

我该怎么做?

Use case when 语句。

您有两种选择。

  1. 使用 SparkSQL
  2. 数据帧操作。 (参考:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.when.html?highlight=when#pyspark.sql.functions.when

让我们用第二种方法来做:

import pyspark.sql.functions as F
(
DF
.withColumn('rule_result', 
   F.when(F.col("type_person") == 'online' & F.col("type_person2") == 'online', 'online')
   .when(F.col("type_person") == 'offline' & F.col("type_person2") == 'online', 'offline')
   .when(F.col("type_person") == 'offline' & F.col("type_person2") == 'online', 'hybrid')
   .when(F.col("type_person") == 'online' & F.col("type_person2") == 'offline', 'hybrid')
   .when(F.col("type_person") == 'hybrid' | F.col("type_person2") == 'hybrid', 'hybrid')
 .otherwise(None)
)