如何替换仅出现在某些内容之间的分隔符?
How do I replace a delimiter that appears only in between something?
我有一个使用此数据的用例:
1. "apple+case"
2. "apple+case+10+cover"
3. "apple+case+10++cover"
4. "+apple"
5. "iphone8+"
目前,我这样做是为了用 space 替换 +,如下所示:
def normalizer(value: String): String = {
if (value == null) {
null
} else {
value.replaceAll("\+", BLANK_SPACE)
}
}
val testUDF = udf(normalizer(_: String): String)
df.withColumn("newCol", testUDF($"value"))
但这是替换所有“+”。如何替换字符串之间的“+”,同时处理以下用例:"apple+case+10++cover" => "apple case 10+ cover"?
The output should be
1. "apple case"
2. "apple case 10 cover"
3. "apple case 10+ cover"
4. "apple"
5. "iphone8+"
您可以尝试进行两个正则表达式替换:
df.withColumn("newCol", regexp_replace(
regexp_replace(testUDF("value"), "(?<=\d)\+(?!\+)", "+ "),
"(?<!\d)\+", " ")).show
内部正则表达式替换将针对单个加号前面有一个数字的边缘情况,应该通过添加 space(但不删除加号)来替换。示例:
apple+case+10+cover --> apple+case+10+ cover
然后,外部正则表达式替换以所有前面没有数字的加号为目标,并将它们替换为 space。示例,从上面继续:
apple+case+10+ cover --> apple case 10+ cover
您可以使用 regexp_replace
代替 udf 执行此操作,它应该快得多。对于大多数情况,您可以在正则表达式中使用否定前瞻,但对于“+apple”,您实际上想要将“+”替换为“”(而不是 space)。最简单的方法是简单地使用正则表达式。
df.withColumn("newCol", regexp_replace($"value", "^\+", ""))
.withColumn("newCol", regexp_replace($"newCol", "\+(?!\+|$)", " "))
这将给出:
+--------------------+--------------------+
|value |newCol |
+--------------------+--------------------+
|apple+case |apple case |
|apple+case+10+cover |apple case 10 cover |
|apple+case+10++cover|apple case 10+ cover|
|+apple |apple |
|iphone8+ |iphone8+ |
+--------------------+--------------------+
为了使其更加模块化和可重用,您可以将其定义为一个函数:
def normalizer(c: String) = regexp_replace(regexp_replace(col(c), "^\+", ""), "\+(?!\+|$)", " ")
df.withColumn("newCol", normalizer("value"))
我有一个使用此数据的用例:
1. "apple+case"
2. "apple+case+10+cover"
3. "apple+case+10++cover"
4. "+apple"
5. "iphone8+"
目前,我这样做是为了用 space 替换 +,如下所示:
def normalizer(value: String): String = {
if (value == null) {
null
} else {
value.replaceAll("\+", BLANK_SPACE)
}
}
val testUDF = udf(normalizer(_: String): String)
df.withColumn("newCol", testUDF($"value"))
但这是替换所有“+”。如何替换字符串之间的“+”,同时处理以下用例:"apple+case+10++cover" => "apple case 10+ cover"?
The output should be
1. "apple case"
2. "apple case 10 cover"
3. "apple case 10+ cover"
4. "apple"
5. "iphone8+"
您可以尝试进行两个正则表达式替换:
df.withColumn("newCol", regexp_replace(
regexp_replace(testUDF("value"), "(?<=\d)\+(?!\+)", "+ "),
"(?<!\d)\+", " ")).show
内部正则表达式替换将针对单个加号前面有一个数字的边缘情况,应该通过添加 space(但不删除加号)来替换。示例:
apple+case+10+cover --> apple+case+10+ cover
然后,外部正则表达式替换以所有前面没有数字的加号为目标,并将它们替换为 space。示例,从上面继续:
apple+case+10+ cover --> apple case 10+ cover
您可以使用 regexp_replace
代替 udf 执行此操作,它应该快得多。对于大多数情况,您可以在正则表达式中使用否定前瞻,但对于“+apple”,您实际上想要将“+”替换为“”(而不是 space)。最简单的方法是简单地使用正则表达式。
df.withColumn("newCol", regexp_replace($"value", "^\+", ""))
.withColumn("newCol", regexp_replace($"newCol", "\+(?!\+|$)", " "))
这将给出:
+--------------------+--------------------+
|value |newCol |
+--------------------+--------------------+
|apple+case |apple case |
|apple+case+10+cover |apple case 10 cover |
|apple+case+10++cover|apple case 10+ cover|
|+apple |apple |
|iphone8+ |iphone8+ |
+--------------------+--------------------+
为了使其更加模块化和可重用,您可以将其定义为一个函数:
def normalizer(c: String) = regexp_replace(regexp_replace(col(c), "^\+", ""), "\+(?!\+|$)", " ")
df.withColumn("newCol", normalizer("value"))