保持唯一的字符串的缩写
Abbreviation of Strings that Remains Unique
我有一个独特的字符串列表(最初的想法是 table 中的列名)。
任务是执行列表的最大可能缩写,因此列表保持不同。
例如AAA, AB
可以简写为AA, AB
。 (但不是 A, AB
——因为 A
可能是 AAA
和 AB
的前缀)。
AAAA, BAAAA
可以缩短为 A, B
。
但是A1, A2
根本不能简写
这里是示例数据
create table tab as
select 'AAA' col from dual union all
select 'AABA' col from dual union all
select 'COL1' col from dual union all
select 'COL21' col from dual union all
select 'AAAAAA' col from dual union all
select 'BBAA' col from dual union all
select 'BAAAA' col from dual union all
select 'AB' col from dual;
预期结果是
COL ABR_COL
------ ------------------------
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
我管理了一个由四个子查询组成的蛮力解决方案,我不是故意 post,因为我希望存在一个我不想分散注意力的更简单的解决方案。
顺便说一下,r
中有一个类似的函数,叫做 abbreviate
,但我正在寻找 SQL 解决方案。欢迎使用其他 RDBMS 的首选 Oracle
解决方案。
这实际上可以使用递归 CTE。我真的没有得到它比三个子查询(加一个查询)短,但至少它不受字符串长度的限制。步骤大致如下:
- 使用递归 CTE 计算所有可能的缩写。这将选择所有列
自己命名,然后递归地将列名缩短一个字母:
Table:
col abbr
--- -------
AAA AAA
AAA AA
AAA A
...
- 对于每个缩写,计算它出现的频率
Table
ABBR CONFLICT
---- --------
AA 3
AAA 2
AABA 1
...
- Select唯一最短的缩写,还有
只是列名本身的缩写,并按缩写的长度对它们进行排名。在示例中,您看到
AAA
与其他一些缩写冲突,但仍必须选择它,因为它等于其未缩短的名称。
Table
COL ABBR CONFLICT POS
-------------------------------
AAA AAA 2 1
AAAAAA AAAA 1 1
AAAAAA AAAAA 1 2
AAAAAA AAAAAA 1 3
AABA AAB 1 1
...
- 为每列选择排名第一的缩写(或列名本身)。
Table
COL ABBR POS
-------------------
AAA AAA 1
AAAAAA AAAA 1
AABA AAB 1
...
完成SQL
这导致以下 SQL,上述步骤作为 CTE:
with potential_abbreviations(col,abbr) as (
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations
where length(abbr) > 1
)
, abbreviation_counts as (
select abbr
, count(*) as conflict
from potential_abbreviations
group by abbr
)
, all_unique_abbreviations(col,abbr,conflict,pos) as (
select
p.col
, p.abbr
, conflict
, rank() over (partition by col order by p.abbr) as pos
from potential_abbreviations p
join abbreviation_counts c on p.abbr = c.abbr
where conflict = 1 or p.col = p.abbr
)
select col, abbr, pos
from all_unique_abbreviations
where pos = 1
order by col, abbr
结果
COL ABBR
------- ----
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
AC1 AC
AD AD
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
我找到了第二种方法,但没有添加到第一个答案中,因为它更短且不同。步骤如下:
- 递归计算每个名字的所有可能缩写
SQL
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations a
where length(abbr) > 1
结果
col abbr
--- -------
AAA AAA
AAA AA
AAA A
...
- 然后计算缩写之间的冲突。还要跟踪导致此缩写的列名。我们只想保留不会导致冲突的缩写,因此
min()
聚合无关紧要。
SQL
select
abbr
, count(*) as conflicts
, min(col) as best_candidate
from potential_abbreviations
group by abbr
having count(*) = 1
结果
ABBR CONFLICTS BEST_CANDIDATE
------- --------- ---------------
AAAA 1 AAAAAA
AAAAA 1 AAAAAA
AAAAAA 1 AAAAAA
AAB 1 AABA
AABA 1 AABA
...
- 最后,将可能的缩写与最佳无冲突候选进行左连接,如果没有无冲突解决方案,则只使用列名:
SQL
select
p.col as col
, nvl(min(c.abbr), p.col) as abbr
from potential_abbreviations p
left join conflict_free c on p.col = c.best_candidate
where c.conflicts = 1 or p.abbr = p.col
group by p.col
order by col, abbr
完成SQL
with potential_abbreviations(col,abbr) as (
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations a
where length(abbr) > 1
)
, conflict_free as (
select
abbr
, count(*) as conflicts
, min(col) as best_candidate
from potential_abbreviations
group by abbr
having count(*) = 1
)
select
p.col as col
-- , c.best_candidate
, nvl(min(c.abbr), p.col) as abbr
-- , min(c.abbr) over (partition by c.best_candidate) shortest
from potential_abbreviations p
left join conflict_free c on p.col = c.best_candidate
where c.conflicts = 1 or p.abbr = p.col
group by p.col, c.best_candidate
order by col, abbr
结果
COL ABBR
------- ----
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
AC1 AC
AD AD
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
注意:对于 Postgresql,递归 CTE 必须是 with recursive
而 Oracle 根本不喜欢 recursive
这个词。
我会在递归 CTE 中进行过滤:
with potential_abbreviations(col, abbr, lev) as (
select col, col as abbr, 1 as lev
from tab
union all
select pa.col, substr(pa.abbr, 1, length(pa.abbr) - 1) as abbr, lev + 1
from potential_abbreviations pa
where length(abbr) > 1 and
not exists (select 1
from tab
where tab.col like substr(pa.abbr, 1, length(pa.abbr) - 1) || '%' and
tab.col <> pa.col
)
)
select pa.col, pa.abbr
from (select pa.*, row_number() over (partition by pa.col order by pa.lev desc) as seqnum
from potential_abbreviations pa
) pa
where seqnum = 1
Here 是一个 db<>fiddle.
lev
完全没有必要。您可以在 order by
中使用 length(abbr) desc
。但是,我通常在使用递归 CTE 时包含一个递归计数器,所以这是习惯。
在 CTE 中进行额外的比较可能看起来更复杂,但它简化了执行——递归在正确的值处停止。
这也在唯一的单个字母 col
值上进行了测试。
我有一个独特的字符串列表(最初的想法是 table 中的列名)。 任务是执行列表的最大可能缩写,因此列表保持不同。
例如AAA, AB
可以简写为AA, AB
。 (但不是 A, AB
——因为 A
可能是 AAA
和 AB
的前缀)。
AAAA, BAAAA
可以缩短为 A, B
。
但是A1, A2
根本不能简写
这里是示例数据
create table tab as
select 'AAA' col from dual union all
select 'AABA' col from dual union all
select 'COL1' col from dual union all
select 'COL21' col from dual union all
select 'AAAAAA' col from dual union all
select 'BBAA' col from dual union all
select 'BAAAA' col from dual union all
select 'AB' col from dual;
预期结果是
COL ABR_COL
------ ------------------------
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
我管理了一个由四个子查询组成的蛮力解决方案,我不是故意 post,因为我希望存在一个我不想分散注意力的更简单的解决方案。
顺便说一下,r
中有一个类似的函数,叫做 abbreviate
,但我正在寻找 SQL 解决方案。欢迎使用其他 RDBMS 的首选 Oracle
解决方案。
这实际上可以使用递归 CTE。我真的没有得到它比三个子查询(加一个查询)短,但至少它不受字符串长度的限制。步骤大致如下:
- 使用递归 CTE 计算所有可能的缩写。这将选择所有列 自己命名,然后递归地将列名缩短一个字母:
Table:
col abbr
--- -------
AAA AAA
AAA AA
AAA A
...
- 对于每个缩写,计算它出现的频率
Table
ABBR CONFLICT
---- --------
AA 3
AAA 2
AABA 1
...
- Select唯一最短的缩写,还有
只是列名本身的缩写,并按缩写的长度对它们进行排名。在示例中,您看到
AAA
与其他一些缩写冲突,但仍必须选择它,因为它等于其未缩短的名称。
Table
COL ABBR CONFLICT POS
-------------------------------
AAA AAA 2 1
AAAAAA AAAA 1 1
AAAAAA AAAAA 1 2
AAAAAA AAAAAA 1 3
AABA AAB 1 1
...
- 为每列选择排名第一的缩写(或列名本身)。
Table
COL ABBR POS
-------------------
AAA AAA 1
AAAAAA AAAA 1
AABA AAB 1
...
完成SQL
这导致以下 SQL,上述步骤作为 CTE:
with potential_abbreviations(col,abbr) as (
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations
where length(abbr) > 1
)
, abbreviation_counts as (
select abbr
, count(*) as conflict
from potential_abbreviations
group by abbr
)
, all_unique_abbreviations(col,abbr,conflict,pos) as (
select
p.col
, p.abbr
, conflict
, rank() over (partition by col order by p.abbr) as pos
from potential_abbreviations p
join abbreviation_counts c on p.abbr = c.abbr
where conflict = 1 or p.col = p.abbr
)
select col, abbr, pos
from all_unique_abbreviations
where pos = 1
order by col, abbr
结果
COL ABBR
------- ----
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
AC1 AC
AD AD
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
我找到了第二种方法,但没有添加到第一个答案中,因为它更短且不同。步骤如下:
- 递归计算每个名字的所有可能缩写
SQL
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations a
where length(abbr) > 1
结果
col abbr
--- -------
AAA AAA
AAA AA
AAA A
...
- 然后计算缩写之间的冲突。还要跟踪导致此缩写的列名。我们只想保留不会导致冲突的缩写,因此
min()
聚合无关紧要。
SQL
select
abbr
, count(*) as conflicts
, min(col) as best_candidate
from potential_abbreviations
group by abbr
having count(*) = 1
结果
ABBR CONFLICTS BEST_CANDIDATE
------- --------- ---------------
AAAA 1 AAAAAA
AAAAA 1 AAAAAA
AAAAAA 1 AAAAAA
AAB 1 AABA
AABA 1 AABA
...
- 最后,将可能的缩写与最佳无冲突候选进行左连接,如果没有无冲突解决方案,则只使用列名:
SQL
select
p.col as col
, nvl(min(c.abbr), p.col) as abbr
from potential_abbreviations p
left join conflict_free c on p.col = c.best_candidate
where c.conflicts = 1 or p.abbr = p.col
group by p.col
order by col, abbr
完成SQL
with potential_abbreviations(col,abbr) as (
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations a
where length(abbr) > 1
)
, conflict_free as (
select
abbr
, count(*) as conflicts
, min(col) as best_candidate
from potential_abbreviations
group by abbr
having count(*) = 1
)
select
p.col as col
-- , c.best_candidate
, nvl(min(c.abbr), p.col) as abbr
-- , min(c.abbr) over (partition by c.best_candidate) shortest
from potential_abbreviations p
left join conflict_free c on p.col = c.best_candidate
where c.conflicts = 1 or p.abbr = p.col
group by p.col, c.best_candidate
order by col, abbr
结果
COL ABBR
------- ----
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
AC1 AC
AD AD
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
注意:对于 Postgresql,递归 CTE 必须是 with recursive
而 Oracle 根本不喜欢 recursive
这个词。
我会在递归 CTE 中进行过滤:
with potential_abbreviations(col, abbr, lev) as (
select col, col as abbr, 1 as lev
from tab
union all
select pa.col, substr(pa.abbr, 1, length(pa.abbr) - 1) as abbr, lev + 1
from potential_abbreviations pa
where length(abbr) > 1 and
not exists (select 1
from tab
where tab.col like substr(pa.abbr, 1, length(pa.abbr) - 1) || '%' and
tab.col <> pa.col
)
)
select pa.col, pa.abbr
from (select pa.*, row_number() over (partition by pa.col order by pa.lev desc) as seqnum
from potential_abbreviations pa
) pa
where seqnum = 1
Here 是一个 db<>fiddle.
lev
完全没有必要。您可以在 order by
中使用 length(abbr) desc
。但是,我通常在使用递归 CTE 时包含一个递归计数器,所以这是习惯。
在 CTE 中进行额外的比较可能看起来更复杂,但它简化了执行——递归在正确的值处停止。
这也在唯一的单个字母 col
值上进行了测试。