使用 oracle 查找最小非重复序列 sql

Find minimum non repeating sequence using oracle sql

考虑一个由字母 a,g,c,t 组成的序列。您必须找到最小的非重复字符序列及其 length.Also 注意非重复字符应按顺序排列。

例如在序列 'aaggcct' 中答案是最小非重复字符是 t 其中 t 是最少非重复字符集,长度是 1。即使 aa,gg,cc,ag,gc,ct 是非重复的,因为 t 至少不重复长度 1,答案是 t. 当我说 t 不重复时,序列中就没有其他 t

对于序列'aaggcctt',答案之一是aa,,例如,是characters.Even的最少不重复集,尽管aag是不重复的最小非重复长度为 2,因此不予考虑。 当我说 'aa' 是非重复时,序列中没有其他 aa。 完整答案如下

    DATA    LENGTH
    ag       2
    gg       2
    ct       2
    cc       2
    aa       2
    tt       2
    gc       2

此处 aa 重复了序列 'aaagggcccttt' 的另一个示例,因此不在答案中。当我说 'aa' 重复时,因为在 'aaa' 中有两个 'aa' 从位置 1 aa 开始,然后是位置 2 aa

    DATA    LENGTH
    ag       2
    ct       2
    gc       2

找到一个字符串的所有子串,然后对它们进行计数以查看其中有多少重复并排除任何不唯一的子串,然后找到每个字符串具有最小长度的子串集:

所以,如果你有测试数据:

CREATE TABLE test_data ( id, value ) AS
  SELECT 1, 'agaga' FROM DUAL UNION ALL
  SELECT 2, 'aaggcct' FROM DUAL UNION ALL
  SELECT 3, 'aaggcctt' FROM DUAL UNION ALL
  SELECT 4, 'aaagggcccttt' FROM DUAL;

那么你可以使用:

WITH substrings ( id, value, length, pos ) AS (
  SELECT id,
         value,
         LENGTH( value ),
         1
  FROM   test_data
UNION ALL
  SELECT id,
         value,
         CASE pos
         WHEN 1
         THEN length - 1
         ELSE length
         END,
         CASE pos
         WHEN 1
         THEN LENGTH(value) - (length-2)
         ELSE pos-1
         END
  FROM   substrings
  WHERE  length > 1
  OR     pos > 1
),
non_repeats ( id, value, substring ) AS (
  SELECT id,
         MIN( value ),
         SUBSTR( value, pos, length )
  FROM   substrings s
  GROUP BY id, SUBSTR( value, pos, length )
  HAVING COUNT(*) = 1
)
SELECT id,
       value,
       substring
FROM   (
  SELECT id,
         value,
         substring,
         RANK() OVER ( PARTITION BY id ORDER BY LENGTH( substring ) ASC ) AS rnk
  FROM   non_repeats
)
WHERE  rnk = 1;

输出:

ID | VALUE        | SUBSTRING
-: | :----------- | :--------
 1 | agaga        | gag      
 2 | aaggcct      | t        
 3 | aaggcctt     | gc       
 3 | aaggcctt     | cc       
 3 | aaggcctt     | ct       
 3 | aaggcctt     | ag       
 3 | aaggcctt     | aa       
 3 | aaggcctt     | tt       
 3 | aaggcctt     | gg       
 4 | aaagggcccttt | ct       
 4 | aaagggcccttt | gc       
 4 | aaagggcccttt | ag       

db<>fiddle here

我尝试了一个答案,首先我使用两个迭代器找到所有不重复的子串组合,然后找到最小长度的非重复子串。

        WITH dna
         AS (SELECT value
                    ||'$' AS seq,
                    id
             FROM   test_data),
         iterator
         AS (SELECT column_value n,
                    id
             FROM   dna
                    cross join TABLE(Cast(MULTISET (SELECT LEVEL
                                              FROM   dual
                                              CONNECT BY LEVEL <= Length(seq)) AS
                                           sys.ODCINUMBERLIST))),
         target_data
         AS (SELECT Count(1)                             cnt,
                    Substr(dna2.seq, i1.n, i2.n)         data1,
                    Length(Substr(dna2.seq, i1.n, i2.n)) lngth,
                    dna2.id,
                    Replace(seq, '$', '')                value
             FROM   dna dna2,
                    iterator i1,
                    iterator i2
             WHERE  dna2.id = i1.id
                    AND dna2.id = i2.id
             GROUP  BY Substr(dna2.seq, i1.n, i2.n),
                       Length(Substr(dna2.seq, i1.n, i2.n)),
                       dna2.id,
                       seq
             HAVING Count(1) = 1)
    SELECT id,
           value,
           data1
    FROM   target_data td
    WHERE  lngth = (SELECT Min(lngth)
                    FROM   target_data td1
                    WHERE  td1.id = td.id)
    ORDER  BY id; 

输出是

    ID  VALUE          DATA1
    1   agaga           gag
    2   aaggcct         t
    3   aaggcctt        cc
    3   aaggcctt        gc
    3   aaggcctt        ag
    3   aaggcctt        ct
    3   aaggcctt        gg
    3   aaggcctt        aa
    3   aaggcctt        tt
    4   aaagggcccttt    ct
    4   aaagggcccttt    ag
    4   aaagggcccttt    gc