在 PySpark 中用 NULL 替换列值
Replace a column value with NULL in PySpark
如何用 null 替换不正确的列值(具有 *
或 #
等字符的值)?
测试数据集:
df = spark.createDataFrame(
[(10, '2021-08-16 00:54:43+01', 0.15, 'SMS'),
(11, '2021-08-16 00:04:29+01', 0.15, '*'),
(12, '2021-08-16 00:39:05+01', 0.15, '***')],
['_c0', 'Timestamp', 'Amount','Channel']
)
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15 |SMS |
# |11 |2021-08-16 00:04:29+01|0.15 |* |
# |12 |2021-08-16 00:39:05+01|0.15 |*** |
# +---+----------------------+------+-------+
脚本:
from pyspark.sql import functions as F
df = df.withColumn('Channel', F.when(~F.col('Channel').rlike(r'[\*#]+'), F.col('Channel')))
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15 |SMS |
# |11 |2021-08-16 00:04:29+01|0.15 |null |
# |12 |2021-08-16 00:39:05+01|0.15 |null |
# +---+----------------------+------+-------+
所以你有多个选择:
第一个选项是使用 when 函数为要替换的每个字符设置替换条件:
第二种选择是使用替换功能。
第三个选项是使用regex_replace将所有字符替换为空值
如何用 null 替换不正确的列值(具有 *
或 #
等字符的值)?
测试数据集:
df = spark.createDataFrame(
[(10, '2021-08-16 00:54:43+01', 0.15, 'SMS'),
(11, '2021-08-16 00:04:29+01', 0.15, '*'),
(12, '2021-08-16 00:39:05+01', 0.15, '***')],
['_c0', 'Timestamp', 'Amount','Channel']
)
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15 |SMS |
# |11 |2021-08-16 00:04:29+01|0.15 |* |
# |12 |2021-08-16 00:39:05+01|0.15 |*** |
# +---+----------------------+------+-------+
脚本:
from pyspark.sql import functions as F
df = df.withColumn('Channel', F.when(~F.col('Channel').rlike(r'[\*#]+'), F.col('Channel')))
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15 |SMS |
# |11 |2021-08-16 00:04:29+01|0.15 |null |
# |12 |2021-08-16 00:39:05+01|0.15 |null |
# +---+----------------------+------+-------+
所以你有多个选择:
第一个选项是使用 when 函数为要替换的每个字符设置替换条件:
第二种选择是使用替换功能。
第三个选项是使用regex_replace将所有字符替换为空值