Apache Spark ML Pipeline：过滤数据集中的空行

Question

在我的 Spark ML 管道 (Spark 2.3.0) 中，我这样使用 RegexTokenizer：

val regexTokenizer = new RegexTokenizer()
      .setInputCol("text")
      .setOutputCol("words")
      .setMinTokenLength(3)

它将DataFrame转换为带有单词数组的数组，例如：

text      | words
-------------------------
a the     | [the]
a of to   | []
big small | [big,small]

如何过滤 [] 数组为空的行？我应该创建自定义转换器并将其传递给管道吗？

Answer 1

df
  .select($text, $words)
  .where(size($words) > 0)

Answer 2

您可以使用 SQLTransformer:

import org.apache.spark.ml.feature.SQLTransformer

val emptyRemover = new SQLTransformer().setStatement(
  "SELECT * FROM __THIS__ WHERE size(words) > 0"
)

可以直接申请

val df = Seq(
  ("a the", Seq("the")), ("a of the", Seq()), 
  ("big small", Seq("big", "small"))
).toDF("text", "words")

emptyRemover.transform(df).show

+---------+------------+
|     text|       words|
+---------+------------+
|    a the|       [the]|
|big small|[big, small]|
+---------+------------+

或用于Pipeline。

尽管如此，在 Spark ML 过程中使用它之前我会考虑两次。通常在下游使用的工具，如 CountVectorizer，可以很好地处理空输入：

import org.apache.spark.ml.feature.CountVectorizer

val vectorizer = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")

+---------+------------+-------------------+                 
|     text|       words|           features|
+---------+------------+-------------------+
|    a the|       [the]|      (3,[2],[1.0])|
| a of the|          []|          (3,[],[])|
|big small|[big, small]|(3,[0,1],[1.0,1.0])|
+---------+------------+-------------------+

并且缺少某些词，通常可以提供有用的信息。

Apache Spark ML Pipeline：过滤数据集中的空行

Apache Spark ML Pipeline: filter empty rows in dataset

scala

apache-spark

apache-spark-sql

apache-spark-ml

apache-spark-mllib