用单个字母替换连续的重复字母

Replace consecutive duplicate letters with a single letter

如何在 SQL Server 2017 中创建一个函数来识别字符串何时包含重复的连续字母 (a-z) 并将这些重复的字母替换为该字母的单个实例?

以下是应该发生的情况的一些示例:

CompanyAAABCD -> CompanyABCD
CommpanyABYTTT -> CompanyABYT
Company11111 -> Company11111
alter function fn_RemoveDuplicateChar(@name varchar(200))
RETURNS VARCHAR(200) 
as
begin 
    declare @strPosition int=1;
    declare @strlen int=0;
    declare @finalstr varchar(200)='';
    declare @str varchar(200)='';
    declare @fstr varchar(200)=''; 
    select @strlen = (select len(@name))

    while @strPosition<=@strlen
    begin
        select @fstr = SUBSTRING(@name, @strPosition, 1)
        select @str  = SUBSTRING(@finalstr, len(@finalstr), 1) 
        If @fstr <> @str or ( ISNUMERIC(@fstr)=1 and ISNUMERIC(@str)=1)
        set @finalstr = @finalstr + @fstr
        set @strPosition =@strPosition+1    
    end
    return (select @finalstr)
end
go
select dbo.fn_RemoveDuplicateChar('CompanyAAABCD')
select dbo.fn_RemoveDuplicateChar('CommpanyABYTTT')
select dbo.fn_RemoveDuplicateChar('Company11111')

如果你只想进行一轮替换(即 aaabbbb 变成 aabb)那么你可以使用这个:

CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (@value varchar(200))
    RETURNS VARCHAR(200)
    WITH SCHEMABINDING
    AS
    BEGIN    
        DECLARE @result varchar(200) = @value;
        DECLARE @i int = 65;
        -- a-z is ASCII 65-90
        WHILE @i < 90
        BEGIN
            SET @result = REPLACE(@result, CHAR(@i) + CHAR(@i), CHAR(@i));
            SET @i += 1
        END;
        RETURN @result;
    END;

GO

但是你似乎需要一个递归替换,以便删除之前具有相同字符的每个字符。

所以我们可以使用这个版本,它与其他答案类似。


CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (@value varchar(200))
RETURNS varchar(200)
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @c char(1);
    DECLARE @cLast char(1) = LEFT(@value, 1);
    DECLARE @result varchar(200) = @cLast;
    DECLARE @strlen int = LEN(@value);
    
    DECLARE @i int = 2;
    WHILE (@i < @strlen)
    BEGIN
        SET @c = SUBSTRING(@value, @i, 1);
        IF (@c <> @cLast)
            SET @result += @c;

        SET @i += 1
    END;

    RETURN @result;
END;

GO

我将其重写为内联 Table 值函数,发现它速度明显更快。 这里有两个版本,具体取决于您是否可以使用 STRING_AGG

CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesXML (@value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
    WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
            L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
        Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
       Chars AS (SELECT TOP(LEN(@value)) rn FROM Nums)
    SELECT (
        SELECT SUBSTRING(@value, rn, 1)
        FROM Chars
        WHERE rn = 1 OR SUBSTRING(@value, rn - 1, 1) <> SUBSTRING(@value, rn, 1)
        ORDER BY rn
        FOR XML PATH(''), TYPE
    ).value('text()[1]','nvarchar(max)') Result
);
        
GO
CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesAGG (@value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
    WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
            L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
        Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
       Chars AS (SELECT TOP(LEN(@value)) rn FROM Nums)
    SELECT STRING_AGG(SUBSTRING(@value, rn, 1), '') WITHIN GROUP (ORDER BY rn) Result
    FROM Chars
    WHERE rn = 1 OR SUBSTRING(@value, rn - 1, 1) <> SUBSTRING(@value, rn, 1)
);
        
GO

这利用 Itzik Ben-Gan's famous inline tally-table method 将字符串分解为单个字符。如果您的字符数超过 256 个,您将需要另一个 CROSS JOIN 或更多 (1)

你有两种使用方法,性能应该是一样的

作为标量子查询

SELECT (SELECT * FROM RemoveDuplicatesAGG(t.MyString) Result
FROM myTable t

或作为 APPLY

SELECT d.Result
FROM myTable t
CROSS APPLY RemoveDuplicatesAGG(t.MyString) d

我知道我来晚了一点但是如果性能很重要那么你可以使用最快的“重复数据删除器”在游戏中(函数 removeDupesExcept8K 位于此 post 的末尾)它需要一个输入字符串和一个表示您想要删除的重复数据的模式;在下面的示例中,我说的是“删除不在 A 到 Z 之间的任何内容。

DECLARE @string VARCHAR(8000) = 'AAABBBCCC999';

SELECT rd.NewString FROM samd.removeDupesExcept8K(@string, '[^A-Z]') AS rd;

Returns: ABC999

让我们将上面 B.Muthamizhselvi 中的 fn_RemoveDuplicateChar 与 post 末尾的 fn_RemoveDuplicateChar 进行比较。

性能测试:

--==== Test Data
SELECT TOP(10000)
  ID     = IDENTITY(INT,1,1),
  String = REPLACE(REPLACE(REPLACE(NEWID(),'A',0),'B',0),'-','AAA')
INTO #strings
FROM sys.all_columns, sys.all_columns b;
GO

--==== Performance Test
PRINT CHAR(13)+'dbo.fn_RemoveDuplicateChar'+CHAR(13)+REPLICATE('-',90);
GO
  DECLARE @st DATETIME = GETDATE(), @x VARCHAR(100);
  
  SELECT @x = dbo.fn_RemoveDuplicateChar(s.String)
  FROM   #strings AS s

PRINT DATEDIFF(MS,@st,GETDATE());
GO 3

PRINT CHAR(13)+'samd.removeDupChar8K - Serial'+CHAR(13)+REPLICATE('-',90);
GO
  DECLARE @st DATETIME = GETDATE(), @x VARCHAR(100);
  
  SELECT @x = rd.NewString
  FROM   #strings AS s
  CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
  OPTION (MAXDOP 1);

PRINT DATEDIFF(MS,@st,GETDATE());
GO 3

PRINT CHAR(13)+'samd.removeDupChar8K - Parallel'+CHAR(13)+REPLICATE('-',90);
GO
  DECLARE @st DATETIME = GETDATE(), @x VARCHAR(100);
  
  SELECT @x = rd.NewString
  FROM   #strings AS s
  CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
  OPTION (QUERYTRACEON 8649);
  PRINT DATEDIFF(MS,@st,GETDATE());
GO 3

如下所示,removeDupesExcept8K 使用串行执行计划(一个 CPU)的速度是原来的两倍,使用并行计划的速度是原来的 10 倍以上。无需使用并行计划测试 fn_RemoveDuplicateChar,除非内联,否则标量 UDF 无法并行。

测试结果:

dbo.fn_RemoveDuplicateChar
------------------------------------------------------------------------------------------
Beginning execution loop
1110
1106
1093
Batch execution completed 3 times.

samd.removeDupChar8K - Serial
------------------------------------------------------------------------------------------
Beginning execution loop
563
560
593
Batch execution completed 3 times.

samd.removeDupChar8K - Parallel
------------------------------------------------------------------------------------------
Beginning execution loop
91
91
93
Batch execution completed 3 times.

函数

IF OBJECT_ID('samd.removeDupesExcept8K') IS NOT NULL DROP FUNCTION samd.removeDupesExcept8K;
GO
CREATE FUNCTION samd.removeDupesExcept8K(@string varchar(8000), @preserved varchar(50))
/*****************************************************************************************
[Purpose]:
 A purely set-based inline table valued function (iTVF) that accepts and input strings
 (@string) and a pattern (@preserved) and removes all duplicate characters in @string that
 do not match the @preserved pattern.

[Author]:
 Alan Burstein

[Compatibility]:
 SQL Server 2008+

[Syntax]:
--===== Autonomous use
 SELECT rd.newString
 FROM   samd.removeDupesExcept8K(@string, @preserved) AS rd;

--===== Use against a table
 SELECT st.SomeColumn1, rd.newString
 FROM   SomeTable AS st
 CROSS 
 APPLY  samd.removeDupesExcept8K(st.SomeColumn1, @preserved) AS rd;

Parameters:
 @string    = varchar(8000); Input string to be "cleaned"
 @preserved = varchar(50); the pattern to preserve. For example, when @preserved='[0-9]'
              only non-numeric characters will be removed

[Return Types]:
 Inline Table Valued Function returns:
 newString = varchar(8000); the string with duplicate characters removed

[Developer Notes]:
 1. Requires NGrams8K. The code for NGrams8K can be found here:
    http://www.sqlservercentral.com/articles/Tally+Table/142316/

 2. This function is what is referred to as an "inline" scalar UDF." Technically it's an
    inline table valued function (iTVF) but performs the same task as a scalar valued user
    defined function (UDF); the difference is that it requires the APPLY table operator
    to accept column values as a parameter. For more about "inline" scalar UDFs see this
    article by SQL MVP Jeff Moden: http://www.sqlservercentral.com/articles/T-SQL/91724/
    and for more about how to use APPLY see the this article by SQL MVP Paul White:
    http://www.sqlservercentral.com/articles/APPLY/69953/.

    Note the above syntax example and usage examples below to better understand how to
    use the function. Although the function is slightly more complicated to use than a
    scalar UDF it will yield notably better performance for many reasons. For example,
    unlike a scalar UDFs or multi-line table valued functions, the inline scalar UDF does
    not restrict the query optimizer's ability generate a parallel query execution plan.

 3. removeDupesExcept8K is deterministic; for more about deterministic and nondeterministic
    functions see https://msdn.microsoft.com/en-us/library/ms178091.aspx

[Examples]:
--===== 1. Examples...
 DECLARE @string varchar(8000) = '!!!aa###bb!!!';
 BEGIN
   --===== 1.1. Remove all duplicate characters
     SELECT f.newString 
     FROM samd.removeDupesExcept8K(@string,'') f; -- Returns: !a#b!
   
   --===== 1.2. Remove all non-alphabetical duplicates
     SELECT f.newString
     FROM samd.removeDupesExcept8K(@string,'[a-z]') f; -- Returns: !aa#bb!
   
   --===== 1.3. Remove only alphabetical duplicates
     SELECT f.newString
     FROM samd.removeDupesExcept8K(@string,'[^a-z]') f; -- Returns: !!!a###b!!!
 END
---------------------------------------------------------------------------------------
[Revision History]:
 Rev 00 - 20160720 - Initial Creation - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
( 
  SELECT   ng.token+''
  FROM     samd.NGrams8K(@string,1) AS ng
  WHERE    ng.token <> SUBSTRING(@string, ng.position+1,1) -- exclude chars = the next char
  OR       ng.token LIKE @preserved -- preserve characters that match the @preserved pattern
  ORDER BY ng.position
  FOR XML PATH(''),TYPE
).value('(text())[1]','varchar(8000)'); -- using Wayne Sheffield’s concatenation logic