在 Spark 数据框中使用 UDF 进行文本挖掘

Question

我有下面的功能

def tokenize(text : String) : Array[String] = {
  // Lowercase each word and remove punctuation.
  text.toLowerCase.replaceAll("[^a-zA-Z0-9\s]", "").split("\s+")
}

需要应用于数据框 df_article 中的列 "title"。我如何使用 UDF 在 spark 中实现这一点？

示例数据

+--------------------+
|               title|
+--------------------+
|A new relictual a...|
|A new relictual a...|
|A new relictual a...|
+--------------------+

Answer 1

您可以这样定义您的 UDF：

import org.apache.spark.sql.functions.udf
val myToken = udf((xs: String) => xs.toLowerCase.replaceAll("[^a-zA-Z0-9\s]", "").split("\s+"))

并创建一个带有附加列的新数据框：

df_article.withColumn("newTitle", myToken(df_article("title")))

或者，您也可以通过以下方式注册您的 tokenize 函数：

val tk = sqlContext.udf.register("tk", tokenize _)

并通过应用获取新数据框：

df_article.withColumn("newTitle", tk(df_article("title")))

Answer 2

我根本不会在这里使用 UDF。您可以使用内置表达式以安全高效的方式轻松组合相同的函数：

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lower, regexp_replace, split}

def tokenize(c: Column) = split(
  regexp_replace(lower(c), "[^a-zA-Z0-9\s]", ""), "\s+"
)

df.select(tokenize($"title"))

还有 ml.feature.Tokenize and ml.featureRegexTokenizer 可能对您有用。

Answer 3

为什么是 UDF？，你可以使用内置函数

这里是 pyspark 中的示例：

from pyspark.sql.functions import regexp_replace, lower

df_article.withColumn("title_cleaned", lower((regexp_replace('title', '([^a-zA-Z0-9\&\b]+)', " "))))

检查内置函数：

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.first

在 Spark 数据框中使用 UDF 进行文本挖掘

Using a UDF in Spark data frame for text mining

user-defined-functions

apache-spark

apache-spark-sql

apache-spark-mllib