使用复杂的分隔符拆分记录

Question

我有一个带有复杂列定界符的传入记录，需要标记该记录。分隔符之一可以是数据的一部分。

我正在寻找正则表达式。需要在 Teradata 16.1 上使用函数“REGEXP_SUBSTR”。

最多可以标记 5 列。计划在 Teradata 中使用案例语句来标记列。我想一个标记的正则表达式就可以解决问题。

Case#1: Column delimiter is ' - ' 
Sample data: On-e - tw o - thr$ee
Required output : [On-e, tw o, thr$ee]

我的尝试：([\S]*)\s{1}\-{1}\s{1}

Case#2 : Column delimiter is '::' 
Sample data : On:e::tw:o::thr$ee 
Required output : [On:e, tw:o, thr$ee]

Case#3 : Column delimiter is ':;' 
Sample data : On:e:;tw;o:;thr$ee
Required output : [On:e, tw;o, thr$ee]

以上3种情况是独立的，不会同时出现，即需要3种不同的解决方案

Answer 1

如果您绝对必须为此使用 RegEx，您可以像下面显示的示例中那样使用捕获组。

一般示例：

/(?<data>.+?)($delimiter|$)/gm

(?<data>.+?) 命名捕获组 data，匹配：
. 任意字符
+? 发生一次到无限次

其次是

($delimiter|$) 另一个捕获组，匹配：
$delimiter - 将其替换为与您的分隔符字符串匹配的正则表达式
| 或
$ 字符串结尾

拿起你的例子：

案例 #1：

列分隔符是'-'

/(?<data>.+?)(\s-\s|$)/gm

(https://regex101.com/r/qMYxAY/1)

案例 #2：

列分隔符是“::”

/(?<data>.+?)(\:\:|$)/gm

https://regex101.com/r/IzaAoA/1

案例 #3：

列分隔符是“:;”

(?<data>.+?)(\:\;|$)

https://regex101.com/r/g1MUb6/1

Answer 2

通常您会使用 STRTOK 在分隔符上拆分字符串。但是 strtok 无法处理 multi-character 分隔符。一种适度的 over-complicated 方法是用单个字符替换定界符的多个字符并在其上拆分。例如：

select
strtok(oreplace(<your column>,' - ', '|'),'|',1) as one,
strtok(oreplace(somecol,' - ', '|'),'|',2) as two,
strtok(oreplace(somecol,' - ', '|'),'|',3) as three,
strtok(oreplace(<your column>,' - ', '|'),'|',4) as four,
strtok(oreplace(<your column>,' - ', '|'),'|',5) as five

如果只有 3 次出现，就像在您的示例中一样，则其他两次 returns 为 null。

使用复杂的分隔符拆分记录

Split records with complex delimiter

regex

teradata