PySpark - 删除 n-gram 中的白色 space
PySpark - Remove white space in n-grams
我正在尝试生成 3 个字母的 n-gram,但 Spark NGram
在每个字母之间插入了一个白色 space。我想去除(或不生产)这个白色 space。我可以分解数组,删除白色 space,然后重新组装数组,但这是一个非常昂贵的操作。最好,我还想避免由于 PySpark UDF 的性能问题而创建 UDF。使用 PySpark 内置函数是否有更便宜的方法?
from pyspark.ml import Pipeline, Model, PipelineModel
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, NGram
from pyspark.sql.functions import *
wordDataFrame = spark.createDataFrame([
(0, "Hello I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic regression models are neat")
], ["id", "words"])
pipeline = Pipeline(stages=[
RegexTokenizer(pattern="", inputCol="words", outputCol="tokens", minTokenLength=1),
NGram(n=3, inputCol="tokens", outputCol="ngrams")
])
model = pipeline.fit(wordDataFrame).transform(wordDataFrame)
model.show()
当前输出为:
+---+--------------------+--------------------+--------------------+
| id| words| tokens| ngrams|
+---+--------------------+--------------------+--------------------+
| 0|Hi I heard about ...|[h, e, l, l, o, ...|[h e l, e l l, ...|
+---+--------------------+--------------------+--------------------+
但想要的是:
+---+--------------------+--------------------+--------------------+
| id| words| tokens| ngrams|
+---+--------------------+--------------------+--------------------+
| 0|Hello I heard ab ...|[h, e, l, l, o, ...|[hel, ell, llo, ...|
+---+--------------------+--------------------+--------------------+
您可以使用高阶函数 transform 和 regex.(spark2.4+)(假设 ngarms 列是 arraytype 和 stringtype )
#sampledataframe
df.show()
+---+----------------+---------------+--------------+
| id| words| tokens| ngrams|
+---+----------------+---------------+--------------+
| 0|Hi I heard about|[h, e, l, l, o]|[h e l, e l l]|
+---+----------------+---------------+--------------+
from pyspark.sql import functions as F
df.withColumn("ngrams", F.expr("""transform(ngrams,x-> regexp_replace(x,"\ ",""))""")).show()
+---+----------------+---------------+----------+
| id| words| tokens| ngrams|
+---+----------------+---------------+----------+
| 0|Hi I heard about|[h, e, l, l, o]|[hel, ell]|
+---+----------------+---------------+----------+
我正在尝试生成 3 个字母的 n-gram,但 Spark NGram
在每个字母之间插入了一个白色 space。我想去除(或不生产)这个白色 space。我可以分解数组,删除白色 space,然后重新组装数组,但这是一个非常昂贵的操作。最好,我还想避免由于 PySpark UDF 的性能问题而创建 UDF。使用 PySpark 内置函数是否有更便宜的方法?
from pyspark.ml import Pipeline, Model, PipelineModel
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, NGram
from pyspark.sql.functions import *
wordDataFrame = spark.createDataFrame([
(0, "Hello I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic regression models are neat")
], ["id", "words"])
pipeline = Pipeline(stages=[
RegexTokenizer(pattern="", inputCol="words", outputCol="tokens", minTokenLength=1),
NGram(n=3, inputCol="tokens", outputCol="ngrams")
])
model = pipeline.fit(wordDataFrame).transform(wordDataFrame)
model.show()
当前输出为:
+---+--------------------+--------------------+--------------------+
| id| words| tokens| ngrams|
+---+--------------------+--------------------+--------------------+
| 0|Hi I heard about ...|[h, e, l, l, o, ...|[h e l, e l l, ...|
+---+--------------------+--------------------+--------------------+
但想要的是:
+---+--------------------+--------------------+--------------------+
| id| words| tokens| ngrams|
+---+--------------------+--------------------+--------------------+
| 0|Hello I heard ab ...|[h, e, l, l, o, ...|[hel, ell, llo, ...|
+---+--------------------+--------------------+--------------------+
您可以使用高阶函数 transform 和 regex.(spark2.4+)(假设 ngarms 列是 arraytype 和 stringtype )
#sampledataframe
df.show()
+---+----------------+---------------+--------------+
| id| words| tokens| ngrams|
+---+----------------+---------------+--------------+
| 0|Hi I heard about|[h, e, l, l, o]|[h e l, e l l]|
+---+----------------+---------------+--------------+
from pyspark.sql import functions as F
df.withColumn("ngrams", F.expr("""transform(ngrams,x-> regexp_replace(x,"\ ",""))""")).show()
+---+----------------+---------------+----------+
| id| words| tokens| ngrams|
+---+----------------+---------------+----------+
| 0|Hi I heard about|[h, e, l, l, o]|[hel, ell]|
+---+----------------+---------------+----------+