如何将两列 spark 数据框与空值连接起来但得到一个值

Question

I have two columns in my spark dataframe:

Name_ls  Name_mg
  Herry   null
  null    Cong   
  Duck    Duck77
  Tinh    Tin_Lee
  Huong   null
  null    Ngon
  Lee     null

My requirement is to add a new column to dataframe by concatenating the above 2 columns
but value of the new column will be one in the two value of the old column is not null
How to do that in pyspark ?
Expected output:
Name_ls  Name_mg       Name
  Herry   null         Herry
  null    Luck         Luck
  Duck    Duck77       Duck
  Tinh    Tin_Lee      Tinh
  Huong   null         Huong
  null    Ngon         Ngon
  Lee     null         Lee

Answer 1

您可以使用 when-otherwise 语句

当 - 否则 - 首先检查 name_mg 是否为 Null ，替换为 name_ls ，elif 为 Not Null ，检查 name_ls 是否为 Null ，替换为 name_ls

数据准备

input_str = """
  Herry   null
  null    Cong   
  Duck    Duck77
  Tinh    Tin_Lee
  Huong   null
  null    Ngon
  Lee     null
""".split()

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "name_ls,name_mg".split(',')))
        
n = len(input_values)
n_cols = 2

input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF.show()

+-------+-------+
|name_ls|name_mg|
+-------+-------+
|  Herry|   null|
|   null|   Cong|
|   Duck| Duck77|
|   Tinh|Tin_Lee|
|  Huong|   null|
|   null|   Ngon|
|    Lee|   null|
+-------+-------+

何时 - 否则

sparkDF = sparkDF.withColumn('name',F.when(F.col('name_mg').isNull()
                      ,F.col('name_ls')).when(F.col('name_ls').isNotNull(),F.col('name_ls'))\
                                        .otherwise(F.col('name_mg'))
              )

sparkDF.show()

+-------+-------+-----+
|name_ls|name_mg| name|
+-------+-------+-----+
|  Herry|   null|Herry|
|   null|   Cong| Cong|
|   Duck| Duck77| Duck|
|   Tinh|Tin_Lee| Tinh|
|  Huong|   null|Huong|
|   null|   Ngon| Ngon|
|    Lee|   null|  Lee|
+-------+-------+-----+

Answer 2

您可以 coalesce 从“pyspark.sql.functions”运行，

from pyspark.sql import functions as f

df.withColumn("name",f.coalesce("name_ls","name_mg")).show()

+-------+-------+-----+
|name_ls|name_mg| name|
+-------+-------+-----+
|  Herry|   null|Herry|
|   null|   Cong| Cong|
|   Duck| Duck77| Duck|
|   Tinh|Tin_Lee| Tinh|
|  Huong|   null|Huong|
|   null|   Ngon| Ngon|
|    Lee|   null|  Lee|
+-------+-------+-----+

Answer 3

如果你和我一样喜欢spark SQL语句，可以考虑NVL函数

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
    ('Herry', None),
    (None, 'Cong'),
    ('Duck', 'Duck77'),
    ('Tinh', 'Tin_Lee'),
    ('Huong', None),
    (None, 'Ngon'),
    ('Lee', None)
]
schema = ['Name_ls', 'Name_mg']
df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView('tmp')
res_sql = """
    select Name_ls,Name_mg,nvl(Name_ls, Name_mg) Name
    from tmp
"""
res_df = spark.sql(res_sql)
res_df.show(truncate=False)

如何将两列 spark 数据框与空值连接起来但得到一个值

How to concatenate two columns of spark dataframe with null values but get one value

concatenation

apache-spark

apache-spark-sql

pyspark

数据准备

何时 - 否则