运行数据框上的正则表达式并将结果存储在新数据框中

Question

我有以下数据框

+----------------------------------
|______value______________________|
| I am going to school         |
| why are you crying         |
| You are not very good my friend |

我想用表情符号过滤行并放入一个新的数据框中。我编写了以下代码将数据框转换为列表，然后遍历列表以识别带有表情符号的句子。但我不知道如何在数据框中应用这些正则表达式。

现有代码

def convertDataFrameToList(combinedDataFrame : DataFrame) : List[Any] = {
    val myList=   combinedDataFrame.select("value").rdd.map(r => r(0)).collect.toList
    myList
  }
val listOutput = convertDataFrameToList(myDaframe)
for(element<- listOutput) {
 val emojiValues =  raw"\p{block=Emoticons}".r.findAllIn(element).toSeq
         val   y =    raw"\p{block=Miscellaneous Symbols and Pictographs}".r.findAllIn(element).toSeq
         val p =  emojiValues ++ y

//process further
}

更新

我尝试了以下正则表达式

 val emoticonResult = myKafkaDataFrame.filter(regexp_extract(col("value"), raw"([\p{block=Emoticons},\p{block=Miscellaneous Symbols and Pictographs},\uuD83E\uDD00-\uD83E\uDDFF])", 1) =!= "")

结果，带有表情符号的行以及没有任何表情符号的行也被返回。我能知道我的代码有什么问题吗？

Answer 1

您可以将 regexp_extract 与您的正则表达式一起使用：

val emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) =!= "")
val no_emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) === "")

emojis.show(false)
+--------------------------+
|value                     |
+--------------------------+
|I am going to school    |
|why are you crying    |
+--------------------------+

no_emojis.show(false)
+-------------------------------+
|value                          |
+-------------------------------+
|You are not very good my friend|
+-------------------------------+

运行数据框上的正则表达式并将结果存储在新数据框中

Run a Regex on a dataframe and store the results in a new dataframe

scala

dataframe

emoji

apache-spark

apache-spark-sql

运行 数据框上的正则表达式并将结果存储在新数据框中

Run a Regex on a dataframe and store the results in a new dataframe

scala

dataframe

emoji

apache-spark

apache-spark-sql

运行数据框上的正则表达式并将结果存储在新数据框中