PySpark 一次替换多个列中的值
PySpark replace value in several column at once
我想用另一个值替换数据框列中的一个值,我必须对许多列(比如 30/100 列)执行此操作
我已经完成了 and 。
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo", "val"), (2, "bar", "baz"), (3, "baz", "buz")]).toDF(["x", "y", "z"])
df.show()
# I can replace "baz" with Null separaely in column y and z
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df = df.withColumn("y", replace(col("y"), "baz"))\
.withColumn("z", replace(col("z"), "baz"))
df.show()
我可以在 y 和 z 列中分别用 Null 替换 "baz"。但我想对所有列都这样做 - 类似于下面的列表理解方式
[replace(df[col], "baz") for col in df.columns]
使用 reduce()
函数:
from functools import reduce
reduce(lambda d, c: d.withColumn(c, replace(col(c), "baz")), [df, 'y', 'z']).show()
#+---+----+----+
#| x| y| z|
#+---+----+----+
#| 1| foo| val|
#| 2| bar|null|
#| 3|null| buz|
#+---+----+----+
您可以使用 select
和列表理解:
df = df.select([replace(f.col(column), 'baz').alias(column) if column!='x' else f.col(column)
for column in df.columns])
df.show()
由于大约有 30/100 列,所以让我们在 DataFrame
中再添加几列以更好地概括它。
# Loading the requisite packages
from pyspark.sql.functions import col, when
df = sc.parallelize([(1,"foo","val","baz","gun","can","baz","buz","oof"),
(2,"bar","baz","baz","baz","got","pet","stu","got"),
(3,"baz","buz","pun","iam","you","omg","sic","baz")]).toDF(["x","y","z","a","b","c","d","e","f"])
df.show()
+---+---+---+---+---+---+---+---+---+
| x| y| z| a| b| c| d| e| f|
+---+---+---+---+---+---+---+---+---+
| 1|foo|val|baz|gun|can|baz|buz|oof|
| 2|bar|baz|baz|baz|got|pet|stu|got|
| 3|baz|buz|pun|iam|you|omg|sic|baz|
+---+---+---+---+---+---+---+---+---+
假设我们想要 replace
baz
和 Null
在所有列中,除了列 x
和 a
。使用 list comprehensions
选择那些必须完成 replacement
的列。
# This contains the list of columns where we apply replace() function
all_column_names = df.columns
print(all_column_names)
['x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f']
columns_to_remove = ['x','a']
columns_for_replacement = [i for i in all_column_names if i not in columns_to_remove]
print(columns_for_replacement)
['y', 'z', 'b', 'c', 'd', 'e', 'f']
最后,使用when()
进行替换,这实际上是if
子句的假名。
# Doing the replacement on all the requisite columns
for i in columns_for_replacement:
df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))
df.show()
+---+----+----+---+----+---+----+---+----+
| x| y| z| a| b| c| d| e| f|
+---+----+----+---+----+---+----+---+----+
| 1| foo| val|baz| gun|can|null|buz| oof|
| 2| bar|null|baz|null|got| pet|stu| got|
| 3|null| buz|pun| iam|you| omg|sic|null|
+---+----+----+---+----+---+----+---+----+
如果可以用普通的 if-else
子句完成替换,则无需创建 UDF
和定义函数来进行替换。 UDF
s 通常是一项代价高昂的操作,应尽可能避免。
我想用另一个值替换数据框列中的一个值,我必须对许多列(比如 30/100 列)执行此操作
我已经完成了
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo", "val"), (2, "bar", "baz"), (3, "baz", "buz")]).toDF(["x", "y", "z"])
df.show()
# I can replace "baz" with Null separaely in column y and z
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df = df.withColumn("y", replace(col("y"), "baz"))\
.withColumn("z", replace(col("z"), "baz"))
df.show()
我可以在 y 和 z 列中分别用 Null 替换 "baz"。但我想对所有列都这样做 - 类似于下面的列表理解方式
[replace(df[col], "baz") for col in df.columns]
使用 reduce()
函数:
from functools import reduce
reduce(lambda d, c: d.withColumn(c, replace(col(c), "baz")), [df, 'y', 'z']).show()
#+---+----+----+
#| x| y| z|
#+---+----+----+
#| 1| foo| val|
#| 2| bar|null|
#| 3|null| buz|
#+---+----+----+
您可以使用 select
和列表理解:
df = df.select([replace(f.col(column), 'baz').alias(column) if column!='x' else f.col(column)
for column in df.columns])
df.show()
由于大约有 30/100 列,所以让我们在 DataFrame
中再添加几列以更好地概括它。
# Loading the requisite packages
from pyspark.sql.functions import col, when
df = sc.parallelize([(1,"foo","val","baz","gun","can","baz","buz","oof"),
(2,"bar","baz","baz","baz","got","pet","stu","got"),
(3,"baz","buz","pun","iam","you","omg","sic","baz")]).toDF(["x","y","z","a","b","c","d","e","f"])
df.show()
+---+---+---+---+---+---+---+---+---+
| x| y| z| a| b| c| d| e| f|
+---+---+---+---+---+---+---+---+---+
| 1|foo|val|baz|gun|can|baz|buz|oof|
| 2|bar|baz|baz|baz|got|pet|stu|got|
| 3|baz|buz|pun|iam|you|omg|sic|baz|
+---+---+---+---+---+---+---+---+---+
假设我们想要 replace
baz
和 Null
在所有列中,除了列 x
和 a
。使用 list comprehensions
选择那些必须完成 replacement
的列。
# This contains the list of columns where we apply replace() function
all_column_names = df.columns
print(all_column_names)
['x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f']
columns_to_remove = ['x','a']
columns_for_replacement = [i for i in all_column_names if i not in columns_to_remove]
print(columns_for_replacement)
['y', 'z', 'b', 'c', 'd', 'e', 'f']
最后,使用when()
进行替换,这实际上是if
子句的假名。
# Doing the replacement on all the requisite columns
for i in columns_for_replacement:
df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))
df.show()
+---+----+----+---+----+---+----+---+----+
| x| y| z| a| b| c| d| e| f|
+---+----+----+---+----+---+----+---+----+
| 1| foo| val|baz| gun|can|null|buz| oof|
| 2| bar|null|baz|null|got| pet|stu| got|
| 3|null| buz|pun| iam|you| omg|sic|null|
+---+----+----+---+----+---+----+---+----+
如果可以用普通的 if-else
子句完成替换,则无需创建 UDF
和定义函数来进行替换。 UDF
s 通常是一项代价高昂的操作,应尽可能避免。