用单个字母替换连续的重复字母
Replace consecutive duplicate letters with a single letter
如何在 SQL Server 2017 中创建一个函数来识别字符串何时包含重复的连续字母 (a-z) 并将这些重复的字母替换为该字母的单个实例?
以下是应该发生的情况的一些示例:
CompanyAAABCD -> CompanyABCD
CommpanyABYTTT -> CompanyABYT
Company11111 -> Company11111
alter function fn_RemoveDuplicateChar(@name varchar(200))
RETURNS VARCHAR(200)
as
begin
declare @strPosition int=1;
declare @strlen int=0;
declare @finalstr varchar(200)='';
declare @str varchar(200)='';
declare @fstr varchar(200)='';
select @strlen = (select len(@name))
while @strPosition<=@strlen
begin
select @fstr = SUBSTRING(@name, @strPosition, 1)
select @str = SUBSTRING(@finalstr, len(@finalstr), 1)
If @fstr <> @str or ( ISNUMERIC(@fstr)=1 and ISNUMERIC(@str)=1)
set @finalstr = @finalstr + @fstr
set @strPosition =@strPosition+1
end
return (select @finalstr)
end
go
select dbo.fn_RemoveDuplicateChar('CompanyAAABCD')
select dbo.fn_RemoveDuplicateChar('CommpanyABYTTT')
select dbo.fn_RemoveDuplicateChar('Company11111')
如果你只想进行一轮替换(即 aaabbbb
变成 aabb
)那么你可以使用这个:
CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (@value varchar(200))
RETURNS VARCHAR(200)
WITH SCHEMABINDING
AS
BEGIN
DECLARE @result varchar(200) = @value;
DECLARE @i int = 65;
-- a-z is ASCII 65-90
WHILE @i < 90
BEGIN
SET @result = REPLACE(@result, CHAR(@i) + CHAR(@i), CHAR(@i));
SET @i += 1
END;
RETURN @result;
END;
GO
但是你似乎需要一个递归替换,以便删除之前具有相同字符的每个字符。
所以我们可以使用这个版本,它与其他答案类似。
CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (@value varchar(200))
RETURNS varchar(200)
WITH SCHEMABINDING
AS
BEGIN
DECLARE @c char(1);
DECLARE @cLast char(1) = LEFT(@value, 1);
DECLARE @result varchar(200) = @cLast;
DECLARE @strlen int = LEN(@value);
DECLARE @i int = 2;
WHILE (@i < @strlen)
BEGIN
SET @c = SUBSTRING(@value, @i, 1);
IF (@c <> @cLast)
SET @result += @c;
SET @i += 1
END;
RETURN @result;
END;
GO
我将其重写为内联 Table 值函数,发现它速度明显更快。 这里有两个版本,具体取决于您是否可以使用 STRING_AGG
CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesXML (@value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
Chars AS (SELECT TOP(LEN(@value)) rn FROM Nums)
SELECT (
SELECT SUBSTRING(@value, rn, 1)
FROM Chars
WHERE rn = 1 OR SUBSTRING(@value, rn - 1, 1) <> SUBSTRING(@value, rn, 1)
ORDER BY rn
FOR XML PATH(''), TYPE
).value('text()[1]','nvarchar(max)') Result
);
GO
CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesAGG (@value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
Chars AS (SELECT TOP(LEN(@value)) rn FROM Nums)
SELECT STRING_AGG(SUBSTRING(@value, rn, 1), '') WITHIN GROUP (ORDER BY rn) Result
FROM Chars
WHERE rn = 1 OR SUBSTRING(@value, rn - 1, 1) <> SUBSTRING(@value, rn, 1)
);
GO
这利用 Itzik Ben-Gan's famous inline tally-table method 将字符串分解为单个字符。如果您的字符数超过 256 个,您将需要另一个 CROSS JOIN
或更多 (1)
。
你有两种使用方法,性能应该是一样的
作为标量子查询
SELECT (SELECT * FROM RemoveDuplicatesAGG(t.MyString) Result
FROM myTable t
或作为 APPLY
SELECT d.Result
FROM myTable t
CROSS APPLY RemoveDuplicatesAGG(t.MyString) d
我知道我来晚了一点但是如果性能很重要那么你可以使用最快的“重复数据删除器”在游戏中(函数 removeDupesExcept8K 位于此 post 的末尾)它需要一个输入字符串和一个表示您想要删除的重复数据的模式;在下面的示例中,我说的是“删除不在 A 到 Z 之间的任何内容。
DECLARE @string VARCHAR(8000) = 'AAABBBCCC999';
SELECT rd.NewString FROM samd.removeDupesExcept8K(@string, '[^A-Z]') AS rd;
Returns: ABC999
让我们将上面 B.Muthamizhselvi 中的 fn_RemoveDuplicateChar 与 post 末尾的 fn_RemoveDuplicateChar 进行比较。
性能测试:
--==== Test Data
SELECT TOP(10000)
ID = IDENTITY(INT,1,1),
String = REPLACE(REPLACE(REPLACE(NEWID(),'A',0),'B',0),'-','AAA')
INTO #strings
FROM sys.all_columns, sys.all_columns b;
GO
--==== Performance Test
PRINT CHAR(13)+'dbo.fn_RemoveDuplicateChar'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @x VARCHAR(100);
SELECT @x = dbo.fn_RemoveDuplicateChar(s.String)
FROM #strings AS s
PRINT DATEDIFF(MS,@st,GETDATE());
GO 3
PRINT CHAR(13)+'samd.removeDupChar8K - Serial'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @x VARCHAR(100);
SELECT @x = rd.NewString
FROM #strings AS s
CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS,@st,GETDATE());
GO 3
PRINT CHAR(13)+'samd.removeDupChar8K - Parallel'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @x VARCHAR(100);
SELECT @x = rd.NewString
FROM #strings AS s
CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
OPTION (QUERYTRACEON 8649);
PRINT DATEDIFF(MS,@st,GETDATE());
GO 3
如下所示,removeDupesExcept8K 使用串行执行计划(一个 CPU)的速度是原来的两倍,使用并行计划的速度是原来的 10 倍以上。无需使用并行计划测试 fn_RemoveDuplicateChar,除非内联,否则标量 UDF 无法并行。
测试结果:
dbo.fn_RemoveDuplicateChar
------------------------------------------------------------------------------------------
Beginning execution loop
1110
1106
1093
Batch execution completed 3 times.
samd.removeDupChar8K - Serial
------------------------------------------------------------------------------------------
Beginning execution loop
563
560
593
Batch execution completed 3 times.
samd.removeDupChar8K - Parallel
------------------------------------------------------------------------------------------
Beginning execution loop
91
91
93
Batch execution completed 3 times.
函数
IF OBJECT_ID('samd.removeDupesExcept8K') IS NOT NULL DROP FUNCTION samd.removeDupesExcept8K;
GO
CREATE FUNCTION samd.removeDupesExcept8K(@string varchar(8000), @preserved varchar(50))
/*****************************************************************************************
[Purpose]:
A purely set-based inline table valued function (iTVF) that accepts and input strings
(@string) and a pattern (@preserved) and removes all duplicate characters in @string that
do not match the @preserved pattern.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous use
SELECT rd.newString
FROM samd.removeDupesExcept8K(@string, @preserved) AS rd;
--===== Use against a table
SELECT st.SomeColumn1, rd.newString
FROM SomeTable AS st
CROSS
APPLY samd.removeDupesExcept8K(st.SomeColumn1, @preserved) AS rd;
Parameters:
@string = varchar(8000); Input string to be "cleaned"
@preserved = varchar(50); the pattern to preserve. For example, when @preserved='[0-9]'
only non-numeric characters will be removed
[Return Types]:
Inline Table Valued Function returns:
newString = varchar(8000); the string with duplicate characters removed
[Developer Notes]:
1. Requires NGrams8K. The code for NGrams8K can be found here:
http://www.sqlservercentral.com/articles/Tally+Table/142316/
2. This function is what is referred to as an "inline" scalar UDF." Technically it's an
inline table valued function (iTVF) but performs the same task as a scalar valued user
defined function (UDF); the difference is that it requires the APPLY table operator
to accept column values as a parameter. For more about "inline" scalar UDFs see this
article by SQL MVP Jeff Moden: http://www.sqlservercentral.com/articles/T-SQL/91724/
and for more about how to use APPLY see the this article by SQL MVP Paul White:
http://www.sqlservercentral.com/articles/APPLY/69953/.
Note the above syntax example and usage examples below to better understand how to
use the function. Although the function is slightly more complicated to use than a
scalar UDF it will yield notably better performance for many reasons. For example,
unlike a scalar UDFs or multi-line table valued functions, the inline scalar UDF does
not restrict the query optimizer's ability generate a parallel query execution plan.
3. removeDupesExcept8K is deterministic; for more about deterministic and nondeterministic
functions see https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Examples...
DECLARE @string varchar(8000) = '!!!aa###bb!!!';
BEGIN
--===== 1.1. Remove all duplicate characters
SELECT f.newString
FROM samd.removeDupesExcept8K(@string,'') f; -- Returns: !a#b!
--===== 1.2. Remove all non-alphabetical duplicates
SELECT f.newString
FROM samd.removeDupesExcept8K(@string,'[a-z]') f; -- Returns: !aa#bb!
--===== 1.3. Remove only alphabetical duplicates
SELECT f.newString
FROM samd.removeDupesExcept8K(@string,'[^a-z]') f; -- Returns: !!!a###b!!!
END
---------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20160720 - Initial Creation - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT ng.token+''
FROM samd.NGrams8K(@string,1) AS ng
WHERE ng.token <> SUBSTRING(@string, ng.position+1,1) -- exclude chars = the next char
OR ng.token LIKE @preserved -- preserve characters that match the @preserved pattern
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('(text())[1]','varchar(8000)'); -- using Wayne Sheffield’s concatenation logic
如何在 SQL Server 2017 中创建一个函数来识别字符串何时包含重复的连续字母 (a-z) 并将这些重复的字母替换为该字母的单个实例?
以下是应该发生的情况的一些示例:
CompanyAAABCD -> CompanyABCD
CommpanyABYTTT -> CompanyABYT
Company11111 -> Company11111
alter function fn_RemoveDuplicateChar(@name varchar(200))
RETURNS VARCHAR(200)
as
begin
declare @strPosition int=1;
declare @strlen int=0;
declare @finalstr varchar(200)='';
declare @str varchar(200)='';
declare @fstr varchar(200)='';
select @strlen = (select len(@name))
while @strPosition<=@strlen
begin
select @fstr = SUBSTRING(@name, @strPosition, 1)
select @str = SUBSTRING(@finalstr, len(@finalstr), 1)
If @fstr <> @str or ( ISNUMERIC(@fstr)=1 and ISNUMERIC(@str)=1)
set @finalstr = @finalstr + @fstr
set @strPosition =@strPosition+1
end
return (select @finalstr)
end
go
select dbo.fn_RemoveDuplicateChar('CompanyAAABCD')
select dbo.fn_RemoveDuplicateChar('CommpanyABYTTT')
select dbo.fn_RemoveDuplicateChar('Company11111')
如果你只想进行一轮替换(即 aaabbbb
变成 aabb
)那么你可以使用这个:
CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (@value varchar(200))
RETURNS VARCHAR(200)
WITH SCHEMABINDING
AS
BEGIN
DECLARE @result varchar(200) = @value;
DECLARE @i int = 65;
-- a-z is ASCII 65-90
WHILE @i < 90
BEGIN
SET @result = REPLACE(@result, CHAR(@i) + CHAR(@i), CHAR(@i));
SET @i += 1
END;
RETURN @result;
END;
GO
但是你似乎需要一个递归替换,以便删除之前具有相同字符的每个字符。
所以我们可以使用这个版本,它与其他答案类似。
CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (@value varchar(200))
RETURNS varchar(200)
WITH SCHEMABINDING
AS
BEGIN
DECLARE @c char(1);
DECLARE @cLast char(1) = LEFT(@value, 1);
DECLARE @result varchar(200) = @cLast;
DECLARE @strlen int = LEN(@value);
DECLARE @i int = 2;
WHILE (@i < @strlen)
BEGIN
SET @c = SUBSTRING(@value, @i, 1);
IF (@c <> @cLast)
SET @result += @c;
SET @i += 1
END;
RETURN @result;
END;
GO
我将其重写为内联 Table 值函数,发现它速度明显更快。 这里有两个版本,具体取决于您是否可以使用 STRING_AGG
CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesXML (@value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
Chars AS (SELECT TOP(LEN(@value)) rn FROM Nums)
SELECT (
SELECT SUBSTRING(@value, rn, 1)
FROM Chars
WHERE rn = 1 OR SUBSTRING(@value, rn - 1, 1) <> SUBSTRING(@value, rn, 1)
ORDER BY rn
FOR XML PATH(''), TYPE
).value('text()[1]','nvarchar(max)') Result
);
GO
CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesAGG (@value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
Chars AS (SELECT TOP(LEN(@value)) rn FROM Nums)
SELECT STRING_AGG(SUBSTRING(@value, rn, 1), '') WITHIN GROUP (ORDER BY rn) Result
FROM Chars
WHERE rn = 1 OR SUBSTRING(@value, rn - 1, 1) <> SUBSTRING(@value, rn, 1)
);
GO
这利用 Itzik Ben-Gan's famous inline tally-table method 将字符串分解为单个字符。如果您的字符数超过 256 个,您将需要另一个 CROSS JOIN
或更多 (1)
。
你有两种使用方法,性能应该是一样的
作为标量子查询
SELECT (SELECT * FROM RemoveDuplicatesAGG(t.MyString) Result
FROM myTable t
或作为 APPLY
SELECT d.Result
FROM myTable t
CROSS APPLY RemoveDuplicatesAGG(t.MyString) d
我知道我来晚了一点但是如果性能很重要那么你可以使用最快的“重复数据删除器”在游戏中(函数 removeDupesExcept8K 位于此 post 的末尾)它需要一个输入字符串和一个表示您想要删除的重复数据的模式;在下面的示例中,我说的是“删除不在 A 到 Z 之间的任何内容。
DECLARE @string VARCHAR(8000) = 'AAABBBCCC999';
SELECT rd.NewString FROM samd.removeDupesExcept8K(@string, '[^A-Z]') AS rd;
Returns: ABC999
让我们将上面 B.Muthamizhselvi 中的 fn_RemoveDuplicateChar 与 post 末尾的 fn_RemoveDuplicateChar 进行比较。
性能测试:
--==== Test Data
SELECT TOP(10000)
ID = IDENTITY(INT,1,1),
String = REPLACE(REPLACE(REPLACE(NEWID(),'A',0),'B',0),'-','AAA')
INTO #strings
FROM sys.all_columns, sys.all_columns b;
GO
--==== Performance Test
PRINT CHAR(13)+'dbo.fn_RemoveDuplicateChar'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @x VARCHAR(100);
SELECT @x = dbo.fn_RemoveDuplicateChar(s.String)
FROM #strings AS s
PRINT DATEDIFF(MS,@st,GETDATE());
GO 3
PRINT CHAR(13)+'samd.removeDupChar8K - Serial'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @x VARCHAR(100);
SELECT @x = rd.NewString
FROM #strings AS s
CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS,@st,GETDATE());
GO 3
PRINT CHAR(13)+'samd.removeDupChar8K - Parallel'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE @st DATETIME = GETDATE(), @x VARCHAR(100);
SELECT @x = rd.NewString
FROM #strings AS s
CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
OPTION (QUERYTRACEON 8649);
PRINT DATEDIFF(MS,@st,GETDATE());
GO 3
如下所示,removeDupesExcept8K 使用串行执行计划(一个 CPU)的速度是原来的两倍,使用并行计划的速度是原来的 10 倍以上。无需使用并行计划测试 fn_RemoveDuplicateChar,除非内联,否则标量 UDF 无法并行。
测试结果:
dbo.fn_RemoveDuplicateChar
------------------------------------------------------------------------------------------
Beginning execution loop
1110
1106
1093
Batch execution completed 3 times.
samd.removeDupChar8K - Serial
------------------------------------------------------------------------------------------
Beginning execution loop
563
560
593
Batch execution completed 3 times.
samd.removeDupChar8K - Parallel
------------------------------------------------------------------------------------------
Beginning execution loop
91
91
93
Batch execution completed 3 times.
函数
IF OBJECT_ID('samd.removeDupesExcept8K') IS NOT NULL DROP FUNCTION samd.removeDupesExcept8K;
GO
CREATE FUNCTION samd.removeDupesExcept8K(@string varchar(8000), @preserved varchar(50))
/*****************************************************************************************
[Purpose]:
A purely set-based inline table valued function (iTVF) that accepts and input strings
(@string) and a pattern (@preserved) and removes all duplicate characters in @string that
do not match the @preserved pattern.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous use
SELECT rd.newString
FROM samd.removeDupesExcept8K(@string, @preserved) AS rd;
--===== Use against a table
SELECT st.SomeColumn1, rd.newString
FROM SomeTable AS st
CROSS
APPLY samd.removeDupesExcept8K(st.SomeColumn1, @preserved) AS rd;
Parameters:
@string = varchar(8000); Input string to be "cleaned"
@preserved = varchar(50); the pattern to preserve. For example, when @preserved='[0-9]'
only non-numeric characters will be removed
[Return Types]:
Inline Table Valued Function returns:
newString = varchar(8000); the string with duplicate characters removed
[Developer Notes]:
1. Requires NGrams8K. The code for NGrams8K can be found here:
http://www.sqlservercentral.com/articles/Tally+Table/142316/
2. This function is what is referred to as an "inline" scalar UDF." Technically it's an
inline table valued function (iTVF) but performs the same task as a scalar valued user
defined function (UDF); the difference is that it requires the APPLY table operator
to accept column values as a parameter. For more about "inline" scalar UDFs see this
article by SQL MVP Jeff Moden: http://www.sqlservercentral.com/articles/T-SQL/91724/
and for more about how to use APPLY see the this article by SQL MVP Paul White:
http://www.sqlservercentral.com/articles/APPLY/69953/.
Note the above syntax example and usage examples below to better understand how to
use the function. Although the function is slightly more complicated to use than a
scalar UDF it will yield notably better performance for many reasons. For example,
unlike a scalar UDFs or multi-line table valued functions, the inline scalar UDF does
not restrict the query optimizer's ability generate a parallel query execution plan.
3. removeDupesExcept8K is deterministic; for more about deterministic and nondeterministic
functions see https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Examples...
DECLARE @string varchar(8000) = '!!!aa###bb!!!';
BEGIN
--===== 1.1. Remove all duplicate characters
SELECT f.newString
FROM samd.removeDupesExcept8K(@string,'') f; -- Returns: !a#b!
--===== 1.2. Remove all non-alphabetical duplicates
SELECT f.newString
FROM samd.removeDupesExcept8K(@string,'[a-z]') f; -- Returns: !aa#bb!
--===== 1.3. Remove only alphabetical duplicates
SELECT f.newString
FROM samd.removeDupesExcept8K(@string,'[^a-z]') f; -- Returns: !!!a###b!!!
END
---------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20160720 - Initial Creation - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT ng.token+''
FROM samd.NGrams8K(@string,1) AS ng
WHERE ng.token <> SUBSTRING(@string, ng.position+1,1) -- exclude chars = the next char
OR ng.token LIKE @preserved -- preserve characters that match the @preserved pattern
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('(text())[1]','varchar(8000)'); -- using Wayne Sheffield’s concatenation logic