如何在 Redshift 中使用两个不同的分隔符从两列中拆分数据?

How to split data from two columns using two different delimiters in Redshift?

我有一个 CTE 有这样的数据。它几乎遵循两种格式,其中 countsprocess_ids 将具有这两种类型的数据。

client_id      day              counts      process_ids
--------------------------------------------------------------------------------------------
abc1          Feb-01-2021        3        C1,C2 | C3,C4,C5 | C6,C7
abc2          Feb-05-2021       2, 3      C10,C11,C12 | C13,C14 # C15,C16 | C17,C18

现在我想在 countsprocess_ids -

上拆分后从上面的 CTE 得到下面的输出
client_id      day              counts      process_ids
--------------------------------------------------------
abc1           Feb-01-2021        3           C1
abc1           Feb-01-2021        3           C2
abc1           Feb-01-2021        3           C3
abc1           Feb-01-2021        3           C4
abc1           Feb-01-2021        3           C5
abc1           Feb-01-2021        3           C6
abc1           Feb-01-2021        3           C7
abc2           Feb-05-2021        2           C10
abc2           Feb-05-2021        2           C11
abc2           Feb-05-2021        2           C12
abc2           Feb-05-2021        2           C13
abc2           Feb-05-2021        2           C14
abc2           Feb-05-2021        3           C15
abc2           Feb-05-2021        3           C16
abc2           Feb-05-2021        3           C17
abc2           Feb-05-2021        3           C18

基本上,想法是根据以下两个用例拆分 countsprocess_ids,如果它们遵循这些格式中的任何一种。

用例 1

如果 counts 列只有一个数字并且 process_ids 列有 | 分隔符。

用例 2

如果 counts 列只有两位数字,由 , 分隔符分隔,并且 process_ids 列有 # 分隔符和 pipe

我在这里与 Amazon Redshift 一起工作,我很困惑如何根据需要将它们分开。

这有可能做到吗?

乍一看可能有点毛茸茸,但它是通过扎实的技术积累起来的,并给出了预期的结果...

SQL

WITH seq_0_9 AS (
  SELECT 0 AS d
  UNION ALL SELECT 1 AS d
  UNION ALL SELECT 2 AS d
  UNION ALL SELECT 3 AS d
  UNION ALL SELECT 4 AS d
  UNION ALL SELECT 5 AS d
  UNION ALL SELECT 6 AS d
  UNION ALL SELECT 7 AS d
  UNION ALL SELECT 8 AS d
  UNION ALL SELECT 9 AS d
),
numbers AS (
  SELECT a.d + b.d * 10 + c.d * 100 + 1 AS n
  FROM seq_0_9 a, seq_0_9 b, seq_0_9 c
),
processed AS
  (SELECT client_id,
          day,
          REPLACE(counts, ' ', '') AS counts,
          REPLACE(REPLACE(process_ids, ' ', ''), '|', ',') AS process_ids
   FROM tbl),
split_pids AS
  (SELECT
     client_id, 
     day,
     counts,
     split_part(process_ids, '#', n) AS process_ids,
     n AS n1
   FROM processed
   CROSS JOIN numbers
   WHERE 
     split_part(process_ids, '#', n) IS NOT NULL
     AND split_part(process_ids, '#', n) != ''),
split_counts AS
  (SELECT
     client_id, 
     day,
     split_part(counts, ',', n) AS counts,
     process_ids,
     n1,
     n AS n2
   FROM split_pids
   CROSS JOIN numbers
   WHERE
     split_part(counts, ',', n) IS NOT NULL
     and split_part(counts, ',', n) != ''),
matched_up AS
  (SELECT * FROM split_counts WHERE n1 = n2)
SELECT
  client_id, 
  day,
  counts,
  split_part(process_ids, ',', n) AS process_ids
FROM
  matched_up
CROSS JOIN
  numbers
WHERE
  split_part(process_ids, ',', n) IS NOT NULL
  AND split_part(process_ids, ',', n) != '';

演示

在线 rextester 演示(使用 PostgreSQL 但应与 Redshift 兼容):https://rextester.com/FNA16497

简要说明

This technique is used to generate a numbers table (from 1 to 1000 inclusive). This technique is then used multiple times with multiple Common Table Expressions 在单个 SQL 语句中实现它。

我已经构建了一个示例脚本,从这个 TSV 开始

client_id   day counts  process_ids
abc1    Feb-01-2021 3   C1,C2 | C3,C4,C5 | C6,C7
abc2    Feb-05-2021 2,3 C10,C11,C12 | C13,C14 # C15,C16 | C17,C18

这是印刷精美的版本

+-----------+-------------+--------+-------------------------------------------+
| client_id | day         | counts | process_ids                               |
+-----------+-------------+--------+-------------------------------------------+
| abc1      | Feb-01-2021 | 3      | C1,C2 | C3,C4,C5 | C6,C7                  |
| abc2      | Feb-05-2021 | 2,3    | C10,C11,C12 | C13,C14 # C15,C16 | C17,C18 |
+-----------+-------------+--------+-------------------------------------------+

我写了这个Miller程序

mlr --tsv clean-whitespace then put -S '
  if ($process_ids=~"|" && $counts=~"^[0-9]$")
    {$process_ids=gsub($process_ids," *[|] *",",")}
  elif($process_ids=~"[#]")
    {$process_ids=gsub(gsub($process_ids," *[|] *",",")," *# *","#");$counts=gsub($counts,",","#")}'  then \
put '
  asplits = splitnv($counts, "#");
  bsplits = splitnv($process_ids, "#");
  n = length(asplits);
  for (int i = 1; i <= n; i += 1) {
    outrec = $*;
    outrec["counts"] = asplits[i];
    outrec["process_ids"] = bsplits[i];
    emit outrec;
  }
' then \
uniq -a then \
filter -x -S '$counts=~"[#]"' then \
cat -n then \
nest --explode --values --across-records -f process_ids --nested-fs "," then \
cut -x -f n input.tsv

给你

client_id       day     counts  process_ids
abc1    Feb-01-2021     3       C1
abc1    Feb-01-2021     3       C2
abc1    Feb-01-2021     3       C3
abc1    Feb-01-2021     3       C4
abc1    Feb-01-2021     3       C5
abc1    Feb-01-2021     3       C6
abc1    Feb-01-2021     3       C7
abc2    Feb-05-2021     2       C10
abc2    Feb-05-2021     2       C11
abc2    Feb-05-2021     2       C12
abc2    Feb-05-2021     2       C13
abc2    Feb-05-2021     2       C14
abc2    Feb-05-2021     3       C15
abc2    Feb-05-2021     3       C16
abc2    Feb-05-2021     3       C17
abc2    Feb-05-2021     3       C18