如何将两列 spark 数据框与空值连接起来但得到一个值
How to concatenate two columns of spark dataframe with null values but get one value
I have two columns in my spark dataframe:
Name_ls Name_mg
Herry null
null Cong
Duck Duck77
Tinh Tin_Lee
Huong null
null Ngon
Lee null
My requirement is to add a new column to dataframe by concatenating the above 2 columns
but value of the new column will be one in the two value of the old column is not null
How to do that in pyspark ?
Expected output:
Name_ls Name_mg Name
Herry null Herry
null Luck Luck
Duck Duck77 Duck
Tinh Tin_Lee Tinh
Huong null Huong
null Ngon Ngon
Lee null Lee
您可以使用 when-otherwise 语句
当 - 否则 - 首先检查 name_mg
是否为 Null ,替换为 name_ls
,elif 为 Not Null ,检查 name_ls
是否为 Null ,替换为 name_ls
数据准备
input_str = """
Herry null
null Cong
Duck Duck77
Tinh Tin_Lee
Huong null
null Ngon
Lee null
""".split()
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "name_ls,name_mg".split(',')))
n = len(input_values)
n_cols = 2
input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF.show()
+-------+-------+
|name_ls|name_mg|
+-------+-------+
| Herry| null|
| null| Cong|
| Duck| Duck77|
| Tinh|Tin_Lee|
| Huong| null|
| null| Ngon|
| Lee| null|
+-------+-------+
何时 - 否则
sparkDF = sparkDF.withColumn('name',F.when(F.col('name_mg').isNull()
,F.col('name_ls')).when(F.col('name_ls').isNotNull(),F.col('name_ls'))\
.otherwise(F.col('name_mg'))
)
sparkDF.show()
+-------+-------+-----+
|name_ls|name_mg| name|
+-------+-------+-----+
| Herry| null|Herry|
| null| Cong| Cong|
| Duck| Duck77| Duck|
| Tinh|Tin_Lee| Tinh|
| Huong| null|Huong|
| null| Ngon| Ngon|
| Lee| null| Lee|
+-------+-------+-----+
您可以 coalesce
从“pyspark.sql.functions”运行,
from pyspark.sql import functions as f
df.withColumn("name",f.coalesce("name_ls","name_mg")).show()
+-------+-------+-----+
|name_ls|name_mg| name|
+-------+-------+-----+
| Herry| null|Herry|
| null| Cong| Cong|
| Duck| Duck77| Duck|
| Tinh|Tin_Lee| Tinh|
| Huong| null|Huong|
| null| Ngon| Ngon|
| Lee| null| Lee|
+-------+-------+-----+
如果你和我一样喜欢spark SQL语句,可以考虑NVL函数
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [
('Herry', None),
(None, 'Cong'),
('Duck', 'Duck77'),
('Tinh', 'Tin_Lee'),
('Huong', None),
(None, 'Ngon'),
('Lee', None)
]
schema = ['Name_ls', 'Name_mg']
df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView('tmp')
res_sql = """
select Name_ls,Name_mg,nvl(Name_ls, Name_mg) Name
from tmp
"""
res_df = spark.sql(res_sql)
res_df.show(truncate=False)
I have two columns in my spark dataframe: Name_ls Name_mg Herry null null Cong Duck Duck77 Tinh Tin_Lee Huong null null Ngon Lee null My requirement is to add a new column to dataframe by concatenating the above 2 columns but value of the new column will be one in the two value of the old column is not null How to do that in pyspark ? Expected output: Name_ls Name_mg Name Herry null Herry null Luck Luck Duck Duck77 Duck Tinh Tin_Lee Tinh Huong null Huong null Ngon Ngon Lee null Lee
您可以使用 when-otherwise 语句
当 - 否则 - 首先检查 name_mg
是否为 Null ,替换为 name_ls
,elif 为 Not Null ,检查 name_ls
是否为 Null ,替换为 name_ls
数据准备
input_str = """
Herry null
null Cong
Duck Duck77
Tinh Tin_Lee
Huong null
null Ngon
Lee null
""".split()
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "name_ls,name_mg".split(',')))
n = len(input_values)
n_cols = 2
input_list = [tuple(input_values[i:i+n_cols]) for i in range(0,n,n_cols)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF.show()
+-------+-------+
|name_ls|name_mg|
+-------+-------+
| Herry| null|
| null| Cong|
| Duck| Duck77|
| Tinh|Tin_Lee|
| Huong| null|
| null| Ngon|
| Lee| null|
+-------+-------+
何时 - 否则
sparkDF = sparkDF.withColumn('name',F.when(F.col('name_mg').isNull()
,F.col('name_ls')).when(F.col('name_ls').isNotNull(),F.col('name_ls'))\
.otherwise(F.col('name_mg'))
)
sparkDF.show()
+-------+-------+-----+
|name_ls|name_mg| name|
+-------+-------+-----+
| Herry| null|Herry|
| null| Cong| Cong|
| Duck| Duck77| Duck|
| Tinh|Tin_Lee| Tinh|
| Huong| null|Huong|
| null| Ngon| Ngon|
| Lee| null| Lee|
+-------+-------+-----+
您可以 coalesce
从“pyspark.sql.functions”运行,
from pyspark.sql import functions as f
df.withColumn("name",f.coalesce("name_ls","name_mg")).show()
+-------+-------+-----+
|name_ls|name_mg| name|
+-------+-------+-----+
| Herry| null|Herry|
| null| Cong| Cong|
| Duck| Duck77| Duck|
| Tinh|Tin_Lee| Tinh|
| Huong| null|Huong|
| null| Ngon| Ngon|
| Lee| null| Lee|
+-------+-------+-----+
如果你和我一样喜欢spark SQL语句,可以考虑NVL函数
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [
('Herry', None),
(None, 'Cong'),
('Duck', 'Duck77'),
('Tinh', 'Tin_Lee'),
('Huong', None),
(None, 'Ngon'),
('Lee', None)
]
schema = ['Name_ls', 'Name_mg']
df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView('tmp')
res_sql = """
select Name_ls,Name_mg,nvl(Name_ls, Name_mg) Name
from tmp
"""
res_df = spark.sql(res_sql)
res_df.show(truncate=False)