如何用单个逗号替换多个逗号并计算 Pyspark DataFrame 每一行中的单词数?

How replace multiple commas by single comma and count words in each line of a Pyspark DataFrame?

我有一个庞大的数据集,每一行都有一些标题被 , 分隔开。我想做两件事:

1- 删除 , 如果它们相互跟随。

2- 计算 ,.

之间的字数

例如,考虑以下两行:

      column
hello, I am wondering/low,,, Going/hi, towards,, Host
winter, summer,,  

预期输出:

      column                                        count
hello, I am wondering/low, Going/hi, towards, Host    5
winter, summer,                                       2

1- remove , if they are followed by each other.

使用 regexp_replace 函数,使用正则表达式将多个逗号替换为一个逗号。您可以使用正则表达式 ,{2,} 表示 2 个或更多逗号。

2- count words between ,.

正如链接的其他问题中所指出的,您需要简单地拆分值并获取数组的大小。但是在这里你可以在值的末尾加上逗号,所以 size 会比实际的单词大。为此,您必须先 filter 数组以消除空字符串。

from pyspark.sql.functions import regexp_replace, expr

data = [
    ("hello, I am wondering / low,,, Going / hi, towards,, Host",),
    ("winter, summer,,",)
]

df = spark.createDataFrame(data, ["column"])

df1 = df.withColumn("column", regexp_replace("column", ",{2,}", ",")) \
    .withColumn("count",
                expr("size(filter(split(column, ','), x -> nullif(x, '') is not null))")
                )

df1.show(truncate=False)

#+------------------------------------------------------+-----+
#|column                                                |count|
#+------------------------------------------------------+-----+
#|hello, I am wondering / low, Going / hi, towards, Host|5    |
#|winter, summer,                                       |2    |
#+------------------------------------------------------+-----+