如何替换仅出现在某些内容之间的分隔符?

How do I replace a delimiter that appears only in between something?

我有一个使用此数据的用例:

1. "apple+case"
2. "apple+case+10+cover"
3. "apple+case+10++cover"
4. "+apple"
5. "iphone8+"

目前,我这样做是为了用 space 替换 +,如下所示:

def normalizer(value: String): String = {
    if (value == null) {
      null
    } else {
       value.replaceAll("\+", BLANK_SPACE)        
     }
  }

  val testUDF = udf(normalizer(_: String): String)

  df.withColumn("newCol",  testUDF($"value"))

但这是替换所有“+”。如何替换字符串之间的“+”,同时处理以下用例:"apple+case+10++cover" => "apple case 10+ cover"?

The output should be
1. "apple case"
2. "apple case 10 cover"
3. "apple case 10+ cover"
4. "apple"
5. "iphone8+"

您可以尝试进行两个正则表达式替换:

df.withColumn("newCol", regexp_replace(
    regexp_replace(testUDF("value"), "(?<=\d)\+(?!\+)", "+ "),
    "(?<!\d)\+", " ")).show

内部正则表达式替换将针对单个加号前面有一个数字的边缘情况,应该通过添加 space(但不删除加号)来替换。示例:

apple+case+10+cover  -->  apple+case+10+ cover

然后,外部正则表达式替换以所有前面没有数字的加号为目标,并将它们替换为 space。示例,从上面继续:

apple+case+10+ cover -->  apple case 10+ cover

您可以使用 regexp_replace 代替 udf 执行此操作,它应该快得多。对于大多数情况,您可以在正则表达式中使用否定前瞻,但对于“+apple”,您实际上想要将“+”替换为“”(而不是 space)。最简单的方法是简单地使用正则表达式。

df.withColumn("newCol", regexp_replace($"value", "^\+", ""))
  .withColumn("newCol", regexp_replace($"newCol", "\+(?!\+|$)", " "))

这将给出:

+--------------------+--------------------+
|value               |newCol              |
+--------------------+--------------------+
|apple+case          |apple case          |
|apple+case+10+cover |apple case 10 cover |
|apple+case+10++cover|apple case 10+ cover|
|+apple              |apple               |
|iphone8+            |iphone8+            |
+--------------------+--------------------+

为了使其更加模块化和可重用,您可以将其定义为一个函数:

def normalizer(c: String) = regexp_replace(regexp_replace(col(c), "^\+", ""), "\+(?!\+|$)", " ")

df.withColumn("newCol", normalizer("value"))