在 PySpark 中用 NULL 替换列值

Replace a column value with NULL in PySpark

如何用 null 替换不正确的列值(具有 *# 等字符的值)?

测试数据集:

df = spark.createDataFrame(
    [(10, '2021-08-16 00:54:43+01', 0.15, 'SMS'),
     (11, '2021-08-16 00:04:29+01', 0.15, '*'),
     (12, '2021-08-16 00:39:05+01', 0.15, '***')],
    ['_c0', 'Timestamp', 'Amount','Channel']
)
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp             |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15  |SMS    |
# |11 |2021-08-16 00:04:29+01|0.15  |*      |
# |12 |2021-08-16 00:39:05+01|0.15  |***    |
# +---+----------------------+------+-------+

脚本:

from pyspark.sql import functions as F

df = df.withColumn('Channel', F.when(~F.col('Channel').rlike(r'[\*#]+'), F.col('Channel')))

df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp             |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15  |SMS    |
# |11 |2021-08-16 00:04:29+01|0.15  |null   |
# |12 |2021-08-16 00:39:05+01|0.15  |null   |
# +---+----------------------+------+-------+

所以你有多个选择:

第一个选项是使用 when 函数为要替换的每个字符设置替换条件:

示例:when function

第二种选择是使用替换功能。

示例:replace function

第三个选项是使用regex_replace将所有字符替换为空值

示例:regex_replace function