如何解码 Spark-scala 中的 HTML 个实体？

Question

我有一个 spark 代码可以从数据库中读取一些数据。名为“title”的列（字符串类型）之一包含以下数据。

+-------------------------------------------------+
|title                                            |
+-------------------------------------------------+
|Example sentence                                 |
|Read the &#8216;Book&#8217;                      | 
|&#8216;LOTR&#8217; Is A Great Book               |
+-------------------------------------------------+

我想删除 HTML 个实体并将其解码为如下所示。

+-------------------------------------------+
|title                                      |
+-------------------------------------------+
|Example sentence                           |
|Read the ‘Book’                            |
|‘LOTR’ Is A Great Book                     |
+-------------------------------------------+

node.js 有一个库“html-enitites”完全符合我的要求，但我找不到类似 spark-scala 的东西。

执行此操作的好方法是什么？

Answer 1

您可以在 UDF 的帮助下使用 org.apache.commons.lang.StringEscapeUtils 来实现这一点。

import org.apache.commons.lang.StringEscapeUtils;

val decodeHtml =  (html:String) => {
    StringEscapeUtils.unescapeHtml(html);
}

val decodeHtmlUDF = udf(decodeHtml)

df.withColumn("title", decodeHtmlUDF($"title")).show()

/*
+--------------------+
|               title|
+--------------------+
|   Example sentence |
|    Read the ‘Book’ |
|‘LOTR’ Is A Great...|
+--------------------+
*/

如何解码 Spark-scala 中的 HTML 个实体？

How to decode HTML entities in Spark-scala?

scala

html-entities

apache-spark

apache-spark-sql