如何用单个逗号替换多个逗号并计算 Pyspark DataFrame 每一行中的单词数?
How replace multiple commas by single comma and count words in each line of a Pyspark DataFrame?
我有一个庞大的数据集,每一行都有一些标题被 ,
分隔开。我想做两件事:
1- 删除 ,
如果它们相互跟随。
2- 计算 ,
.
之间的字数
例如,考虑以下两行:
column
hello, I am wondering/low,,, Going/hi, towards,, Host
winter, summer,,
预期输出:
column count
hello, I am wondering/low, Going/hi, towards, Host 5
winter, summer, 2
1- remove ,
if they are followed by each other.
使用 regexp_replace
函数,使用正则表达式将多个逗号替换为一个逗号。您可以使用正则表达式 ,{2,}
表示 2 个或更多逗号。
2- count words between ,
.
正如链接的其他问题中所指出的,您需要简单地拆分值并获取数组的大小。但是在这里你可以在值的末尾加上逗号,所以 size
会比实际的单词大。为此,您必须先 filter
数组以消除空字符串。
from pyspark.sql.functions import regexp_replace, expr
data = [
("hello, I am wondering / low,,, Going / hi, towards,, Host",),
("winter, summer,,",)
]
df = spark.createDataFrame(data, ["column"])
df1 = df.withColumn("column", regexp_replace("column", ",{2,}", ",")) \
.withColumn("count",
expr("size(filter(split(column, ','), x -> nullif(x, '') is not null))")
)
df1.show(truncate=False)
#+------------------------------------------------------+-----+
#|column |count|
#+------------------------------------------------------+-----+
#|hello, I am wondering / low, Going / hi, towards, Host|5 |
#|winter, summer, |2 |
#+------------------------------------------------------+-----+
我有一个庞大的数据集,每一行都有一些标题被 ,
分隔开。我想做两件事:
1- 删除 ,
如果它们相互跟随。
2- 计算 ,
.
例如,考虑以下两行:
column
hello, I am wondering/low,,, Going/hi, towards,, Host
winter, summer,,
预期输出:
column count
hello, I am wondering/low, Going/hi, towards, Host 5
winter, summer, 2
1- remove
,
if they are followed by each other.
使用 regexp_replace
函数,使用正则表达式将多个逗号替换为一个逗号。您可以使用正则表达式 ,{2,}
表示 2 个或更多逗号。
2- count words between
,
.
正如链接的其他问题中所指出的,您需要简单地拆分值并获取数组的大小。但是在这里你可以在值的末尾加上逗号,所以 size
会比实际的单词大。为此,您必须先 filter
数组以消除空字符串。
from pyspark.sql.functions import regexp_replace, expr
data = [
("hello, I am wondering / low,,, Going / hi, towards,, Host",),
("winter, summer,,",)
]
df = spark.createDataFrame(data, ["column"])
df1 = df.withColumn("column", regexp_replace("column", ",{2,}", ",")) \
.withColumn("count",
expr("size(filter(split(column, ','), x -> nullif(x, '') is not null))")
)
df1.show(truncate=False)
#+------------------------------------------------------+-----+
#|column |count|
#+------------------------------------------------------+-----+
#|hello, I am wondering / low, Going / hi, towards, Host|5 |
#|winter, summer, |2 |
#+------------------------------------------------------+-----+