保持唯一的字符串的缩写

Abbreviation of Strings that Remains Unique

我有一个独特的字符串列表(最初的想法是 table 中的列名)。 任务是执行列表的最大可能缩写,因此列表保持不同。

例如AAA, AB可以简写为AA, AB。 (但不是 A, AB——因为 A 可能是 AAAAB 的前缀)。 AAAA, BAAAA 可以缩短为 A, B。 但是A1, A2根本不能简写

这里是示例数据

create table tab as 
select 'AAA' col from dual union all
select 'AABA' col from dual union all
select 'COL1' col from dual union all
select 'COL21' col from dual union all
select 'AAAAAA' col from dual union all
select 'BBAA' col from dual union all
select 'BAAAA' col from dual union all
select 'AB' col from dual;

预期结果是

COL    ABR_COL                
------ ------------------------
AAA    AAA                      
AAAAAA AAAA                     
AABA   AAB                      
AB     AB                       
BAAAA  BA                       
BBAA   BB                       
COL1   COL1                     
COL21  COL2        

我管理了一个由四个子查询组成的蛮力解决方案,我不是故意 post,因为我希望存在一个我不想分散注意力的更简单的解决方案。

顺便说一下,r 中有一个类似的函数,叫做 abbreviate,但我正在寻找 SQL 解决方案。欢迎使用其他 RDBMS 的首选 Oracle 解决方案。

这实际上可以使用递归 CTE。我真的没有得到它比三个子查询(加一个查询)短,但至少它不受字符串长度的限制。步骤大致如下:

  1. 使用递归 CTE 计算所有可能的缩写。这将选择所有列 自己命名,然后递归地将列名缩短一个字母:

Table:

 col    abbr
 --- -------
 AAA    AAA
 AAA    AA
 AAA    A
 ...
  1. 对于每个缩写,计算它出现的频率

Table

ABBR    CONFLICT
----    --------
AA      3
AAA     2
AABA    1
...
  1. Select唯一最短的缩写,还有 只是列名本身的缩写,并按缩写的长度对它们进行排名。在示例中,您看到 AAA 与其他一些缩写冲突,但仍必须选择它,因为它等于其未缩短的名称。

Table

COL     ABBR    CONFLICT    POS
-------------------------------
AAA     AAA     2           1
AAAAAA  AAAA    1           1
AAAAAA  AAAAA   1           2
AAAAAA  AAAAAA  1           3
AABA    AAB     1           1
...
  1. 为每列选择排名第一的缩写(或列名本身)。

Table

COL     ABBR    POS
-------------------
AAA     AAA     1
AAAAAA  AAAA    1
AABA    AAB     1
...

完成SQL

这导致以下 SQL,上述步骤作为 CTE:

with potential_abbreviations(col,abbr) as (
  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations
  where length(abbr) > 1
)
, abbreviation_counts as (
  select abbr
       , count(*) as conflict
  from potential_abbreviations
  group by abbr
)
, all_unique_abbreviations(col,abbr,conflict,pos) as (
select
    p.col
  , p.abbr
  , conflict
  , rank() over (partition by col order by p.abbr) as pos
  from potential_abbreviations p
    join abbreviation_counts c on p.abbr = c.abbr
    where conflict = 1 or p.col = p.abbr
)
select col, abbr, pos
from all_unique_abbreviations
where pos = 1
 order by col, abbr

结果

COL     ABBR
------- ----
AAA     AAA
AAAAAA  AAAA
AABA    AAB
AB      AB
AC1     AC
AD      AD
BAAAA   BA
BBAA    BB
COL1    COL1
COL21   COL2

SQL Fiddle

我找到了第二种方法,但没有添加到第一个答案中,因为它更短且不同。步骤如下:

  1. 递归计算每个名字的所有可能缩写

SQL

  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations a
  where length(abbr) > 1

结果

 col    abbr
 --- -------
 AAA    AAA
 AAA    AA
 AAA    A
 ...
  1. 然后计算缩写之间的冲突。还要跟踪导致此缩写的列名。我们只想保留不会导致冲突的缩写,因此 min() 聚合无关紧要。

SQL

select
    abbr
  , count(*) as conflicts
  , min(col) as best_candidate
  from potential_abbreviations
 group by abbr
having count(*) = 1

结果

ABBR    CONFLICTS BEST_CANDIDATE
------- --------- ---------------
AAAA    1         AAAAAA
AAAAA   1         AAAAAA
AAAAAA  1         AAAAAA
AAB     1         AABA
AABA    1         AABA
...
  1. 最后,将可能的缩写与最佳无冲突候选进行左连接,如果没有无冲突解决方案,则只使用列名:

SQL

select
    p.col as col
  , nvl(min(c.abbr), p.col) as abbr
  from potential_abbreviations p
  left join conflict_free c on p.col = c.best_candidate
 where c.conflicts = 1 or p.abbr = p.col
 group by p.col
  order by col, abbr

完成SQL

with potential_abbreviations(col,abbr) as (
  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations a
 where length(abbr) > 1
)
, conflict_free as (
    select
        abbr
      , count(*) as conflicts
      , min(col) as best_candidate
      from potential_abbreviations
     group by abbr
    having count(*) = 1
)
select
    p.col as col
  -- , c.best_candidate
  , nvl(min(c.abbr), p.col) as abbr
  -- , min(c.abbr) over (partition by c.best_candidate) shortest
  from potential_abbreviations p
  left join conflict_free c on p.col = c.best_candidate
 where c.conflicts = 1 or p.abbr = p.col
 group by p.col, c.best_candidate
 order by col, abbr

结果

COL     ABBR
------- ----
AAA     AAA
AAAAAA  AAAA
AABA    AAB
AB      AB
AC1     AC
AD      AD
BAAAA   BA
BBAA    BB
COL1    COL1
COL21   COL2

SQL Fiddle

注意:对于 Postgresql,递归 CTE 必须是 with recursive 而 Oracle 根本不喜欢 recursive 这个词。

我会在递归 CTE 中进行过滤:

with potential_abbreviations(col, abbr, lev) as (
      select col, col as abbr, 1 as lev
      from tab
      union all
      select pa.col, substr(pa.abbr, 1, length(pa.abbr) - 1) as abbr, lev + 1
      from potential_abbreviations pa
      where length(abbr) > 1 and
            not exists (select 1
                        from tab
                        where tab.col like substr(pa.abbr, 1, length(pa.abbr) - 1) || '%' and
                              tab.col <> pa.col
                       )
     )
select pa.col, pa.abbr
from (select pa.*, row_number() over (partition by pa.col order by pa.lev desc) as seqnum
      from potential_abbreviations pa
     ) pa
where seqnum = 1

Here 是一个 db<>fiddle.

lev 完全没有必要。您可以在 order by 中使用 length(abbr) desc。但是,我通常在使用递归 CTE 时包含一个递归计数器,所以这是习惯。

在 CTE 中进行额外的比较可能看起来更复杂,但它简化了执行——递归在正确的值处停止。

这也在唯一的单个字母 col 值上进行了测试。