SQL 服务器使用正则表达式模式生成数据
SQL SERVER generate data using Regex pattern
我想通过 SQL Server
中给定的正则表达式模式生成数据。有没有可能做?说,我有如下模式,我想生成如下数据:
这个概念背后的想法是 SQL STATIC DATA MASKING (which was removed in current feature)。我们的客户想要屏蔽测试数据库中的生产数据。我们现在没有 SQL 带有 sql 的 STATIC DATA MASKING 功能,但是我们有模式来屏蔽列,所以我想的是,使用这些模式我们可以 运行 更新查询.
SELECT "(\d){7}" AS RandonNumber, "(\W){5}" AS RandomString FROM tbl
输出应该是
+---------------+--------------+
| RandonNumber | RandomString |
+---------------+--------------+
| 7894562 | AHJIL |
+---------------+--------------+
| 9632587 | ZLOKP |
+---------------+--------------+
| 4561238 | UJIOK |
+---------------+--------------+
除了这个常规模式外,我还有一些自定义模式,如 Test_Product_(\d){1,4}
,其结果应如下所示:
Test_Product_012
Test_Product_143
Test_Product_8936
我将用于掩蔽的完整模式
Other Patterns Samples
(\l){30} ahukoklijfahukokponmahukoahuko
(\d){7} 7895623
(\W){5} ABCDEF
Test_Product_(\d){1,4} Test_Product_007
0\.(\d){2} 0.59
https://www\.(\l){10}\.com https://www.anything.com
我不认为您为此需要正则表达式。为什么不直接使用 "scrub script" 并利用 newid()
函数生成一堆随机数据。看来您无论如何都需要编写这样的脚本,无论是否使用 Regex,而且这样做的好处是非常简单。
假设您从以下数据开始:
create table tbl (PersonalId int, Name varchar(max))
insert into tbl select 300300, 'Michael'
insert into tbl select 554455, 'Tim'
insert into tbl select 228899, 'John'
select * from tbl
然后运行你的脚本:
update tbl set PersonalId = cast(rand(checksum(newid())) * 1000000 as int)
update tbl set Name = left(convert(varchar(255), newid()), 6)
select * from tbl
好吧,我可以给你一个解决方案,它不是基于正则表达式,而是基于一组参数 - 但它包含你所有要求的完整集合。
我将此解决方案基于我编写的用于生成随机字符串的用户定义函数 (You can read my blog post about it here) - 我刚刚对其进行了更改,以便它可以根据以下条件生成您想要的掩码:
- 掩码有可选前缀。
- 掩码有一个可选的后缀。
- 掩码有一个可变长度的随机字符串。
- 随机字符串可以包含小写字母、大写字母、数字或以上的任意组合。
我根据您对问题的更新决定了这些规则集,其中包含您想要的掩码:
(\d){7} 7895623
(\W){5} ABCDEF
Test_Product_(\d){1,4} Test_Product_007
0\.(\d){2} 0.59
https://www\.(\l){10}\.com https://www.anything.com
现在,对于代码:
由于我使用的是用户定义的函数,我无法在其中使用 NewId()
内置函数 - 因此我们首先需要创建一个视图来为我们生成 guid:
CREATE VIEW GuidGenerator
AS
SELECT Newid() As NewGuid;
在函数中,我们将使用该视图生成一个 NewID()
作为所有随机性的基础。
这个函数本身比我开始使用的随机字符串生成器要麻烦得多:
CREATE FUNCTION dbo.MaskGenerator
(
-- use null or an empty string for no prefix
@Prefix nvarchar(4000),
-- use null or an empty string for no suffix
@suffix nvarchar(4000),
-- the minimum length of the random part
@MinLength int,
-- the maximum length of the random part
@MaxLength int,
-- the maximum number of rows to return. Note: up to 1,000,000 rows
@Count int,
-- 1, 2 and 4 stands for lower-case, upper-case and digits.
-- a bitwise combination of these values can be used to generate all possible combinations:
-- 3: lower and upper, 5: lower and digis, 6: upper and digits, 7: lower, upper nad digits
@CharType tinyint
)
RETURNS TABLE
AS
RETURN
-- An inline tally table with 1,000,000 rows
WITH E1(N) AS (SELECT N FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) V(N)), -- 10
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100
E3(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000
Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) FROM E3 a, E2 b) --1,000,000
SELECT TOP(@Count)
n As Number,
CONCAT(@Prefix, (
SELECT TOP (Length)
-- choose what char combination to use for the random part
CASE @CharType
WHEN 1 THEN Lower
WHEN 2 THEN Upper
WHEN 3 THEN IIF(Rnd % 2 = 0, Lower, Upper)
WHEN 4 THEN Digit
WHEN 5 THEN IIF(Rnd % 2 = 0, Lower, Digit)
WHEN 6 THEN IIF(Rnd % 2 = 0, Upper, Digit)
WHEN 7 THEN
CASE Rnd % 3
WHEN 0 THEN Lower
WHEN 1 THEN Upper
ELSE Digit
END
END
FROM Tally As t0
-- create a random number from the guid using the GuidGenerator view
CROSS APPLY (SELECT Abs(Checksum(NewGuid)) As Rnd FROM GuidGenerator) As rand
CROSS APPLY
(
-- generate a random lower-case char, upper-case char and digit
SELECT CHAR(97 + Rnd % 26) As Lower, -- Random lower case letter
CHAR(65 + Rnd % 26) As Upper,-- Random upper case letter
CHAR(48 + Rnd % 10) As Digit -- Random digit
) As Chars
WHERE t0.n <> -t1.n -- Needed for the subquery to get re-evaluated for each row
FOR XML PATH('')
), @Suffix) As RandomString
FROM Tally As t1
CROSS APPLY
(
-- Select a random length between @MinLength and @MaxLength (inclusive)
SELECT TOP 1 n As Length
FROM Tally As t2
CROSS JOIN GuidGenerator
WHERE t2.n >= @MinLength
AND t2.n <= @MaxLength
AND t2.n <> t1.n
ORDER BY NewGuid
) As Lengths;
最后,测试用例:
(\l){30} - ahukoklijfahukokponmahukoahuko
SELECT RandomString FROM dbo.MaskGenerator(null, null, 30, 30, 2, 1);
结果:
1, eyrutkzdugogyhxutcmcmplvzofser
2, juuyvtzsvmmcdkngnzipvsepviepsp
(\d){7} - 7895623
SELECT RandomString FROM dbo.MaskGenerator(null, null, 7, 7, 2, 4);
结果:
1, 8744412
2, 2275313
(\W){5} - ABCDE
SELECT RandomString FROM dbo.MaskGenerator(null, null, 5, 5, 2, 2);
结果:
1, RSYJE
2, MMFAA
Test_Product_(\d){1,4} - Test_Product_007
SELECT RandomString FROM dbo.MaskGenerator('Test_Product_', null, 1, 4, 2, 4);
结果:
1, Test_Product_933
2, Test_Product_7
0\.(\d){2} - 0.59
SELECT RandomString FROM dbo.MaskGenerator('0.', null, 2, 2, 2, 4);
结果:
1, 0.68
2, 0.70
https://www\.(\l){10}\.com - https://www.anything.com
SELECT RandomString FROM dbo.MaskGenerator('https://www.', '.com', 10, 10, 2, 1);
结果:
1, https://www.xayvkmkuci.com
2, https://www.asbfcvomax.com
以下是如何使用它来屏蔽 table 的内容:
DECLARE @Count int = 10;
SELECT CAST(IntVal.RandomString As Int) As IntColumn,
UpVal.RandomString as UpperCaseValue,
LowVal.RandomString as LowerCaseValue,
MixVal.RandomString as MixedValue,
WithPrefix.RandomString As PrefixedValue
FROM dbo.MaskGenerator(null, null, 3, 7, @Count, 4) As IntVal
JOIN dbo.MaskGenerator(null, null, 10, 10, @Count, 1) As LowVal
ON IntVal.Number = LowVal.Number
JOIN dbo.MaskGenerator(null, null, 5, 10, @Count, 2) As UpVal
ON IntVal.Number = UpVal.Number
JOIN dbo.MaskGenerator(null, null, 10, 20, @Count, 7) As MixVal
ON IntVal.Number = MixVal.Number
JOIN dbo.MaskGenerator('Test ', null, 1, 4, @Count, 4) As WithPrefix
ON IntVal.Number = WithPrefix.Number
结果:
IntColumn UpperCaseValue LowerCaseValue MixedValue PrefixedValue
674 CCNVSDI esjyyesesv O2FAC7bfwg2Be5a91Q0 Test 4935
30732 UJKSL jktisddbnq 7o8B91Sg1qrIZSvG3AcL Test 0
4669472 HDLJNBWPJ qgtfkjdyku xUoLAZ4pAnpn Test 8
26347 DNAKERR vlehbnampb NBv08yJdKb75ybhaFqED Test 91
6084965 LJPMZMEU ccigzyfwnf MPxQ2t8jjmv0IT45yVcR Test 4
6619851 FEHKGHTUW wswuefehsp 40n7Ttg7H5YtVPF Test 848
781 LRWKVDUV bywoxqizju UxIp2O4Jb82Ts Test 6268
52237 XXNPBL beqxrgstdo Uf9j7tCB4W2 Test 43
876150 ZDRABW fvvinypvqa uo8zfRx07s6d0EP Test 7
请注意,这是一个快速的过程 - 在我进行的测试中,生成 5 列的 1000 行平均花费了不到半秒的时间。
我想通过 SQL Server
中给定的正则表达式模式生成数据。有没有可能做?说,我有如下模式,我想生成如下数据:
这个概念背后的想法是 SQL STATIC DATA MASKING (which was removed in current feature)。我们的客户想要屏蔽测试数据库中的生产数据。我们现在没有 SQL 带有 sql 的 STATIC DATA MASKING 功能,但是我们有模式来屏蔽列,所以我想的是,使用这些模式我们可以 运行 更新查询.
SELECT "(\d){7}" AS RandonNumber, "(\W){5}" AS RandomString FROM tbl
输出应该是
+---------------+--------------+
| RandonNumber | RandomString |
+---------------+--------------+
| 7894562 | AHJIL |
+---------------+--------------+
| 9632587 | ZLOKP |
+---------------+--------------+
| 4561238 | UJIOK |
+---------------+--------------+
除了这个常规模式外,我还有一些自定义模式,如 Test_Product_(\d){1,4}
,其结果应如下所示:
Test_Product_012
Test_Product_143
Test_Product_8936
我将用于掩蔽的完整模式
Other Patterns Samples
(\l){30} ahukoklijfahukokponmahukoahuko
(\d){7} 7895623
(\W){5} ABCDEF
Test_Product_(\d){1,4} Test_Product_007
0\.(\d){2} 0.59
https://www\.(\l){10}\.com https://www.anything.com
我不认为您为此需要正则表达式。为什么不直接使用 "scrub script" 并利用 newid()
函数生成一堆随机数据。看来您无论如何都需要编写这样的脚本,无论是否使用 Regex,而且这样做的好处是非常简单。
假设您从以下数据开始:
create table tbl (PersonalId int, Name varchar(max))
insert into tbl select 300300, 'Michael'
insert into tbl select 554455, 'Tim'
insert into tbl select 228899, 'John'
select * from tbl
然后运行你的脚本:
update tbl set PersonalId = cast(rand(checksum(newid())) * 1000000 as int)
update tbl set Name = left(convert(varchar(255), newid()), 6)
select * from tbl
好吧,我可以给你一个解决方案,它不是基于正则表达式,而是基于一组参数 - 但它包含你所有要求的完整集合。
我将此解决方案基于我编写的用于生成随机字符串的用户定义函数 (You can read my blog post about it here) - 我刚刚对其进行了更改,以便它可以根据以下条件生成您想要的掩码:
- 掩码有可选前缀。
- 掩码有一个可选的后缀。
- 掩码有一个可变长度的随机字符串。
- 随机字符串可以包含小写字母、大写字母、数字或以上的任意组合。
我根据您对问题的更新决定了这些规则集,其中包含您想要的掩码:
(\d){7} 7895623 (\W){5} ABCDEF Test_Product_(\d){1,4} Test_Product_007 0\.(\d){2} 0.59 https://www\.(\l){10}\.com https://www.anything.com
现在,对于代码:
由于我使用的是用户定义的函数,我无法在其中使用 NewId()
内置函数 - 因此我们首先需要创建一个视图来为我们生成 guid:
CREATE VIEW GuidGenerator
AS
SELECT Newid() As NewGuid;
在函数中,我们将使用该视图生成一个 NewID()
作为所有随机性的基础。
这个函数本身比我开始使用的随机字符串生成器要麻烦得多:
CREATE FUNCTION dbo.MaskGenerator
(
-- use null or an empty string for no prefix
@Prefix nvarchar(4000),
-- use null or an empty string for no suffix
@suffix nvarchar(4000),
-- the minimum length of the random part
@MinLength int,
-- the maximum length of the random part
@MaxLength int,
-- the maximum number of rows to return. Note: up to 1,000,000 rows
@Count int,
-- 1, 2 and 4 stands for lower-case, upper-case and digits.
-- a bitwise combination of these values can be used to generate all possible combinations:
-- 3: lower and upper, 5: lower and digis, 6: upper and digits, 7: lower, upper nad digits
@CharType tinyint
)
RETURNS TABLE
AS
RETURN
-- An inline tally table with 1,000,000 rows
WITH E1(N) AS (SELECT N FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) V(N)), -- 10
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100
E3(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000
Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) FROM E3 a, E2 b) --1,000,000
SELECT TOP(@Count)
n As Number,
CONCAT(@Prefix, (
SELECT TOP (Length)
-- choose what char combination to use for the random part
CASE @CharType
WHEN 1 THEN Lower
WHEN 2 THEN Upper
WHEN 3 THEN IIF(Rnd % 2 = 0, Lower, Upper)
WHEN 4 THEN Digit
WHEN 5 THEN IIF(Rnd % 2 = 0, Lower, Digit)
WHEN 6 THEN IIF(Rnd % 2 = 0, Upper, Digit)
WHEN 7 THEN
CASE Rnd % 3
WHEN 0 THEN Lower
WHEN 1 THEN Upper
ELSE Digit
END
END
FROM Tally As t0
-- create a random number from the guid using the GuidGenerator view
CROSS APPLY (SELECT Abs(Checksum(NewGuid)) As Rnd FROM GuidGenerator) As rand
CROSS APPLY
(
-- generate a random lower-case char, upper-case char and digit
SELECT CHAR(97 + Rnd % 26) As Lower, -- Random lower case letter
CHAR(65 + Rnd % 26) As Upper,-- Random upper case letter
CHAR(48 + Rnd % 10) As Digit -- Random digit
) As Chars
WHERE t0.n <> -t1.n -- Needed for the subquery to get re-evaluated for each row
FOR XML PATH('')
), @Suffix) As RandomString
FROM Tally As t1
CROSS APPLY
(
-- Select a random length between @MinLength and @MaxLength (inclusive)
SELECT TOP 1 n As Length
FROM Tally As t2
CROSS JOIN GuidGenerator
WHERE t2.n >= @MinLength
AND t2.n <= @MaxLength
AND t2.n <> t1.n
ORDER BY NewGuid
) As Lengths;
最后,测试用例:
(\l){30} - ahukoklijfahukokponmahukoahuko
SELECT RandomString FROM dbo.MaskGenerator(null, null, 30, 30, 2, 1);
结果:
1, eyrutkzdugogyhxutcmcmplvzofser
2, juuyvtzsvmmcdkngnzipvsepviepsp
(\d){7} - 7895623
SELECT RandomString FROM dbo.MaskGenerator(null, null, 7, 7, 2, 4);
结果:
1, 8744412
2, 2275313
(\W){5} - ABCDE
SELECT RandomString FROM dbo.MaskGenerator(null, null, 5, 5, 2, 2);
结果:
1, RSYJE
2, MMFAA
Test_Product_(\d){1,4} - Test_Product_007
SELECT RandomString FROM dbo.MaskGenerator('Test_Product_', null, 1, 4, 2, 4);
结果:
1, Test_Product_933
2, Test_Product_7
0\.(\d){2} - 0.59
SELECT RandomString FROM dbo.MaskGenerator('0.', null, 2, 2, 2, 4);
结果:
1, 0.68
2, 0.70
https://www\.(\l){10}\.com - https://www.anything.com
SELECT RandomString FROM dbo.MaskGenerator('https://www.', '.com', 10, 10, 2, 1);
结果:
1, https://www.xayvkmkuci.com
2, https://www.asbfcvomax.com
以下是如何使用它来屏蔽 table 的内容:
DECLARE @Count int = 10;
SELECT CAST(IntVal.RandomString As Int) As IntColumn,
UpVal.RandomString as UpperCaseValue,
LowVal.RandomString as LowerCaseValue,
MixVal.RandomString as MixedValue,
WithPrefix.RandomString As PrefixedValue
FROM dbo.MaskGenerator(null, null, 3, 7, @Count, 4) As IntVal
JOIN dbo.MaskGenerator(null, null, 10, 10, @Count, 1) As LowVal
ON IntVal.Number = LowVal.Number
JOIN dbo.MaskGenerator(null, null, 5, 10, @Count, 2) As UpVal
ON IntVal.Number = UpVal.Number
JOIN dbo.MaskGenerator(null, null, 10, 20, @Count, 7) As MixVal
ON IntVal.Number = MixVal.Number
JOIN dbo.MaskGenerator('Test ', null, 1, 4, @Count, 4) As WithPrefix
ON IntVal.Number = WithPrefix.Number
结果:
IntColumn UpperCaseValue LowerCaseValue MixedValue PrefixedValue
674 CCNVSDI esjyyesesv O2FAC7bfwg2Be5a91Q0 Test 4935
30732 UJKSL jktisddbnq 7o8B91Sg1qrIZSvG3AcL Test 0
4669472 HDLJNBWPJ qgtfkjdyku xUoLAZ4pAnpn Test 8
26347 DNAKERR vlehbnampb NBv08yJdKb75ybhaFqED Test 91
6084965 LJPMZMEU ccigzyfwnf MPxQ2t8jjmv0IT45yVcR Test 4
6619851 FEHKGHTUW wswuefehsp 40n7Ttg7H5YtVPF Test 848
781 LRWKVDUV bywoxqizju UxIp2O4Jb82Ts Test 6268
52237 XXNPBL beqxrgstdo Uf9j7tCB4W2 Test 43
876150 ZDRABW fvvinypvqa uo8zfRx07s6d0EP Test 7
请注意,这是一个快速的过程 - 在我进行的测试中,生成 5 列的 1000 行平均花费了不到半秒的时间。