运行 数据框上的正则表达式并将结果存储在新数据框中
Run a Regex on a dataframe and store the results in a new dataframe
我有以下数据框
+----------------------------------
|______value______________________|
| I am going to school |
| why are you crying |
| You are not very good my friend |
我想用表情符号过滤行并放入一个新的数据框中。我编写了以下代码将数据框转换为列表,然后遍历列表以识别带有表情符号的句子。但我不知道如何在数据框中应用这些正则表达式。
现有代码
def convertDataFrameToList(combinedDataFrame : DataFrame) : List[Any] = {
val myList= combinedDataFrame.select("value").rdd.map(r => r(0)).collect.toList
myList
}
val listOutput = convertDataFrameToList(myDaframe)
for(element<- listOutput) {
val emojiValues = raw"\p{block=Emoticons}".r.findAllIn(element).toSeq
val y = raw"\p{block=Miscellaneous Symbols and Pictographs}".r.findAllIn(element).toSeq
val p = emojiValues ++ y
//process further
}
更新
我尝试了以下正则表达式
val emoticonResult = myKafkaDataFrame.filter(regexp_extract(col("value"), raw"([\p{block=Emoticons},\p{block=Miscellaneous Symbols and Pictographs},\uuD83E\uDD00-\uD83E\uDDFF])", 1) =!= "")
结果,带有表情符号的行以及没有任何表情符号的行也被返回。我能知道我的代码有什么问题吗?
您可以将 regexp_extract
与您的正则表达式一起使用:
val emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) =!= "")
val no_emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) === "")
emojis.show(false)
+--------------------------+
|value |
+--------------------------+
|I am going to school |
|why are you crying |
+--------------------------+
no_emojis.show(false)
+-------------------------------+
|value |
+-------------------------------+
|You are not very good my friend|
+-------------------------------+
我有以下数据框
+----------------------------------
|______value______________________|
| I am going to school |
| why are you crying |
| You are not very good my friend |
我想用表情符号过滤行并放入一个新的数据框中。我编写了以下代码将数据框转换为列表,然后遍历列表以识别带有表情符号的句子。但我不知道如何在数据框中应用这些正则表达式。
现有代码
def convertDataFrameToList(combinedDataFrame : DataFrame) : List[Any] = {
val myList= combinedDataFrame.select("value").rdd.map(r => r(0)).collect.toList
myList
}
val listOutput = convertDataFrameToList(myDaframe)
for(element<- listOutput) {
val emojiValues = raw"\p{block=Emoticons}".r.findAllIn(element).toSeq
val y = raw"\p{block=Miscellaneous Symbols and Pictographs}".r.findAllIn(element).toSeq
val p = emojiValues ++ y
//process further
}
更新
我尝试了以下正则表达式
val emoticonResult = myKafkaDataFrame.filter(regexp_extract(col("value"), raw"([\p{block=Emoticons},\p{block=Miscellaneous Symbols and Pictographs},\uuD83E\uDD00-\uD83E\uDDFF])", 1) =!= "")
结果,带有表情符号的行以及没有任何表情符号的行也被返回。我能知道我的代码有什么问题吗?
您可以将 regexp_extract
与您的正则表达式一起使用:
val emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) =!= "")
val no_emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) === "")
emojis.show(false)
+--------------------------+
|value |
+--------------------------+
|I am going to school |
|why are you crying |
+--------------------------+
no_emojis.show(false)
+-------------------------------+
|value |
+-------------------------------+
|You are not very good my friend|
+-------------------------------+