如何使用 PySpark 在用于字符串的列中用 NULL 替换整数的任何实例？

Question

注意：这是针对 Spark 版本 2.1.1.2.6.1.0-129

我有一个 spark 数据框。其中一列的状态为字符串类型（例如伊利诺伊州、加利福尼亚州、内华达州）。此列中有一些数字实例（例如 12、24、01、2）。我想用 NULL.

替换整数的任何实例

下面是我写的一些代码：

my_df = my_df.selectExpr(
        " regexp_replace(states, '^-?[0-9]+$', '') AS states ",
        "someOtherColumn")

此正则表达式用空字符串替换整数的任何实例。我想用 python 中的 None 替换它，以将其指定为 DataFrame 中的 NULL 值。

Answer 1

我强烈建议您查看 PySpark SQL functions，并尝试正确使用它们而不是 selectExpr

from pyspark.sql import functions as F

(df
    .withColumn('states', F
        .when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
        .otherwise(F.col('states'))
    )
    .show()
)

# Output
# +----------+------------+
# |    states|states_fixed|
# +----------+------------+
# |  Illinois|    Illinois|
# |        12|        null|
# |California|  California|
# |        01|        null|
# |    Nevada|      Nevada|
# +----------+------------+

如何使用 PySpark 在用于字符串的列中用 NULL 替换整数的任何实例？

How to replace any instances of an integer with NULL in a column meant for strings using PySpark?

python

python-2.7

apache-spark

regexp-replace

pyspark