如何计算 SQL Server 2014 中同一列中的数据组顺序?
How can I count data groups sequential order in the same column in SQL Server 2014?
我需要找到数据序列的频率 rows.I 有大约 17000 行数据,其中包括近 120 种不同类型的数据,我需要找到哪些数据序列重复了多少次?
例如:
a
b
c
a
b
d
a
b
c
我正在尝试查找一些人称之为频率序列的重复顺序。那么 aa 和 abc 以及 ab 和 bc 以及 abca 等等在这一列中出现了多少次?我的意思是我需要找出这个数据有多少次具有相同的行组。
对于这个例子,它有 4 个不同的数据,所以有很多组合。用于计算:C(4,1)*4!+C(4,2)*2!+C(4,3)*3!+C(4,4) 不同的顺序,我需要为每个顺序计算它有多少次?
我的真实列数据示例的短部分:(每个连续数据等于一行)
3E010000
2010000
2010007
2010008
2010000
2010003
2010009
0201000A
0B01000C
2010002
3E010000
2010000
2010007
0B010014
2010009
0201000A
0B01000C
2010002
现在,如果您可以检查这组数据的整个主列:
3E010000
2010000
2010007
还有这个
3E010000
2010000
还有这个
2010009
0201000A
0B01000C
2010002
等等。你可以看到它们被重复了不止一次。
这些行在主要的第一列中重复,我试图从 120 种数据组合中找到 1、2、3、4 和最多 5 组订单。
我正在使用 Microsoft SQL Server 2014。但是如果在 Microsoft SQL Server 中无法使用,那么您可以给我任何建议或其他工具。请问你能帮帮我吗?太感谢了!
输出:
0B010009 ,0B010009,0B010009,2010005,2010005,2010005 2 9
0B010014 ,0B010014,0B010014,16010002,16010002,16010002 2 3
2010002,2010002,0201FFE0,0201FFE0 2 13
0B0114B5 ,0B0114B5,0B0114B5,2010002,2010002,2010002,2010004,2010004,2010004 3 3
070105B3 ,070105B3,070105B3,2010005,2010005,2010005,0201FFE1 ,0201FFE1,0201FFE1
3 2
3E010000 ,3E010000,3E010000,0B010010,0B010010,0B010010 ,0B01F61D ,0B01F61D,0B01F61D 3 6
3E010002 ,3E010002,3E010002,0B010013,0B010013,0B010013 ,0B01F80D ,0B01F80D,0B01F80D 3 3
0B010003 ,0B010003,2010006,2010006,0B01000A ,0B01000A,2010005,2010005 4 2
0B01FFE1 ,0B01FFE1,0B01FFE1,0B010013,0B010013,0B010013 ,0B01EAD0 ,0B01EAD0,0B01EAD0,0B010004,0B010004,0B010004 4 4
0B01000C ,0B01000C,0B01000C,0B01FCBD,0B01FCBD,0B01FCBD ,0701FFE0 ,0701FFE0,0701FFE0,0B01000A,0B01000A,0B01000A 4 5
下面的查询找到了 2、3、4 和 5 个重复行的重复模式。
它使用 'LEAD' and 'HASHBYTES' 个函数。
查询的工作原理是计算当前行 + 后续行中的值的哈希序列,然后对这些哈希值进行分组以查找 "duplicate" 模式。此过程针对每一行完成。
注意: 一个不断增加的序列列(表示行位置),即假定 ID。
CREATE TABLE #Data( ID INT IDENTITY PRIMARY KEY, Val VARCHAR( 20 ))
INSERT INTO #Data
VALUES
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '2010008' ), ( '2010000' ),
( '2010003' ), ( '2010009' ), ( '0201000A' ), ( '0B01000C' ), ( '2010002' ),
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '0B010014' ), ( '2010009' ),
( '0201000A' ), ( '0B01000C' ), ( '2010002' )
SELECT Pat3Rows, COUNT(*) AS Cnt
FROM(
SELECT *,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
FROM #Data AS D1
) AS HashedGroups
GROUP BY Pat3Rows
HAVING COUNT(*) > 1
注意:有可能会遇到哈希冲突,尽管这种可能性极小,因此上述逻辑不能保证处理所有理论上可能的情况。总之,如果某人的生活取决于 始终 100% 准确的程序,我不建议使用它。
你没有指定输出应该是什么样子,所以我会把它留给你。
我还在我的笔记本电脑上测试了 18,000 行,它在不到 1 秒的时间内产生了结果。
示例用例:
;WITH DataHashed AS(
SELECT *,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
FROM #Data ),
RepeatingPatterns AS(
SELECT MIN( ID ) AS FirstRow, Pat2Rows AS PatternHash, 2 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat2Rows HAVING COUNT(*) > 1
UNION ALL
SELECT MIN( ID ) AS FirstRow, Pat3Rows, 3 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat3Rows HAVING COUNT(*) > 1
UNION ALL
SELECT MIN( ID ) AS FirstRow, Pat4Rows, 4 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat4Rows HAVING COUNT(*) > 1
UNION ALL
SELECT MIN( ID ) AS FirstRow, Pat5Rows, 5 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat5Rows HAVING COUNT(*) > 1
)
--SELECT * FROM RepeatingPatterns
SELECT
CONVERT( VARCHAR( 50 ), SUBSTRING(
( SELECT ',' + D.Val AS [text()]
FROM #Data AS D
WHERE RP.FirstRow <= D.ID AND D.ID < ( RP.FirstRow + RP.PatternSize )
ORDER BY D.ID
FOR XML PATH ('')
), 2, 1000 )) AS Pattern, CONVERT( VARCHAR( 35 ), PatternHash, 1 ) AS PatternHash, RP.PatternSize, Cnt
FROM RepeatingPatterns AS RP
示例输出:
Pattern PatternHash PatternSize Cnt
-------------------------------------------------- ----------------------------------- ----------- -----------
0201000A,0B01000C 0x499D8B1750A9BF57795B4D60D58DCF81 2 2
2010000,2010007 0x7EDE1E675D934F3035DACAC53F74DD14 2 2
3E010000,2010000 0x85FBFD817CFBB9BD08E983671EB594B7 2 2
2010009,0201000A 0x8E18E36B989BD859AF039238711A7F8C 2 2
0B01000C,2010002 0xF1EABB115FB3AEF2D162FB3EC7B6AFDA 2 2
0201000A,0B01000C,2010002 0x6DE203B38A13501881610133C1EDBF85 3 2
2010009,0201000A,0B01000C 0x9EB3ACFE8580A39FC530C7CA54830602 3 2
3E010000,2010000,2010007 0xE414661F54C985B7ED9FA82FF05C1219 3 2
2010009,0201000A,0B01000C,2010002 0x7FCDB748E37A6F6299AE8B269A4B0E49 4 2
我需要找到数据序列的频率 rows.I 有大约 17000 行数据,其中包括近 120 种不同类型的数据,我需要找到哪些数据序列重复了多少次?
例如:
a
b
c
a
b
d
a
b
c
我正在尝试查找一些人称之为频率序列的重复顺序。那么 aa 和 abc 以及 ab 和 bc 以及 abca 等等在这一列中出现了多少次?我的意思是我需要找出这个数据有多少次具有相同的行组。
对于这个例子,它有 4 个不同的数据,所以有很多组合。用于计算:C(4,1)*4!+C(4,2)*2!+C(4,3)*3!+C(4,4) 不同的顺序,我需要为每个顺序计算它有多少次?
我的真实列数据示例的短部分:(每个连续数据等于一行)
3E010000
2010000
2010007
2010008
2010000
2010003
2010009
0201000A
0B01000C
2010002
3E010000
2010000
2010007
0B010014
2010009
0201000A
0B01000C
2010002
现在,如果您可以检查这组数据的整个主列:
3E010000
2010000
2010007
还有这个
3E010000
2010000
还有这个
2010009
0201000A
0B01000C
2010002
等等。你可以看到它们被重复了不止一次。
这些行在主要的第一列中重复,我试图从 120 种数据组合中找到 1、2、3、4 和最多 5 组订单。
我正在使用 Microsoft SQL Server 2014。但是如果在 Microsoft SQL Server 中无法使用,那么您可以给我任何建议或其他工具。请问你能帮帮我吗?太感谢了!
输出:
0B010009 ,0B010009,0B010009,2010005,2010005,2010005 2 9
0B010014 ,0B010014,0B010014,16010002,16010002,16010002 2 3
2010002,2010002,0201FFE0,0201FFE0 2 13
0B0114B5 ,0B0114B5,0B0114B5,2010002,2010002,2010002,2010004,2010004,2010004 3 3
070105B3 ,070105B3,070105B3,2010005,2010005,2010005,0201FFE1 ,0201FFE1,0201FFE1
3 2
3E010000 ,3E010000,3E010000,0B010010,0B010010,0B010010 ,0B01F61D ,0B01F61D,0B01F61D 3 6
3E010002 ,3E010002,3E010002,0B010013,0B010013,0B010013 ,0B01F80D ,0B01F80D,0B01F80D 3 3
0B010003 ,0B010003,2010006,2010006,0B01000A ,0B01000A,2010005,2010005 4 2
0B01FFE1 ,0B01FFE1,0B01FFE1,0B010013,0B010013,0B010013 ,0B01EAD0 ,0B01EAD0,0B01EAD0,0B010004,0B010004,0B010004 4 4
0B01000C ,0B01000C,0B01000C,0B01FCBD,0B01FCBD,0B01FCBD ,0701FFE0 ,0701FFE0,0701FFE0,0B01000A,0B01000A,0B01000A 4 5
下面的查询找到了 2、3、4 和 5 个重复行的重复模式。
它使用 'LEAD' and 'HASHBYTES' 个函数。
查询的工作原理是计算当前行 + 后续行中的值的哈希序列,然后对这些哈希值进行分组以查找 "duplicate" 模式。此过程针对每一行完成。
注意: 一个不断增加的序列列(表示行位置),即假定 ID。
CREATE TABLE #Data( ID INT IDENTITY PRIMARY KEY, Val VARCHAR( 20 ))
INSERT INTO #Data
VALUES
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '2010008' ), ( '2010000' ),
( '2010003' ), ( '2010009' ), ( '0201000A' ), ( '0B01000C' ), ( '2010002' ),
( '3E010000' ), ( '2010000' ), ( '2010007' ), ( '0B010014' ), ( '2010009' ),
( '0201000A' ), ( '0B01000C' ), ( '2010002' )
SELECT Pat3Rows, COUNT(*) AS Cnt
FROM(
SELECT *,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
HASHBYTES( 'MD5', Val + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
FROM #Data AS D1
) AS HashedGroups
GROUP BY Pat3Rows
HAVING COUNT(*) > 1
注意:有可能会遇到哈希冲突,尽管这种可能性极小,因此上述逻辑不能保证处理所有理论上可能的情况。总之,如果某人的生活取决于 始终 100% 准确的程序,我不建议使用它。
你没有指定输出应该是什么样子,所以我会把它留给你。
我还在我的笔记本电脑上测试了 18,000 行,它在不到 1 秒的时间内产生了结果。
示例用例:
;WITH DataHashed AS(
SELECT *,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID )) AS Pat2Rows,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID )) AS Pat3Rows,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 3, '' ) OVER( ORDER BY ID )) AS Pat4Rows,
HASHBYTES( 'MD5', Val + ',' + LEAD( Val, 1, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 2, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 3, '' ) OVER( ORDER BY ID ) + ',' + LEAD( Val, 4, '' ) OVER( ORDER BY ID )) AS Pat5Rows
FROM #Data ),
RepeatingPatterns AS(
SELECT MIN( ID ) AS FirstRow, Pat2Rows AS PatternHash, 2 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat2Rows HAVING COUNT(*) > 1
UNION ALL
SELECT MIN( ID ) AS FirstRow, Pat3Rows, 3 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat3Rows HAVING COUNT(*) > 1
UNION ALL
SELECT MIN( ID ) AS FirstRow, Pat4Rows, 4 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat4Rows HAVING COUNT(*) > 1
UNION ALL
SELECT MIN( ID ) AS FirstRow, Pat5Rows, 5 AS PatternSize, COUNT( * ) AS Cnt FROM DataHashed GROUP BY Pat5Rows HAVING COUNT(*) > 1
)
--SELECT * FROM RepeatingPatterns
SELECT
CONVERT( VARCHAR( 50 ), SUBSTRING(
( SELECT ',' + D.Val AS [text()]
FROM #Data AS D
WHERE RP.FirstRow <= D.ID AND D.ID < ( RP.FirstRow + RP.PatternSize )
ORDER BY D.ID
FOR XML PATH ('')
), 2, 1000 )) AS Pattern, CONVERT( VARCHAR( 35 ), PatternHash, 1 ) AS PatternHash, RP.PatternSize, Cnt
FROM RepeatingPatterns AS RP
示例输出:
Pattern PatternHash PatternSize Cnt
-------------------------------------------------- ----------------------------------- ----------- -----------
0201000A,0B01000C 0x499D8B1750A9BF57795B4D60D58DCF81 2 2
2010000,2010007 0x7EDE1E675D934F3035DACAC53F74DD14 2 2
3E010000,2010000 0x85FBFD817CFBB9BD08E983671EB594B7 2 2
2010009,0201000A 0x8E18E36B989BD859AF039238711A7F8C 2 2
0B01000C,2010002 0xF1EABB115FB3AEF2D162FB3EC7B6AFDA 2 2
0201000A,0B01000C,2010002 0x6DE203B38A13501881610133C1EDBF85 3 2
2010009,0201000A,0B01000C 0x9EB3ACFE8580A39FC530C7CA54830602 3 2
3E010000,2010000,2010007 0xE414661F54C985B7ED9FA82FF05C1219 3 2
2010009,0201000A,0B01000C,2010002 0x7FCDB748E37A6F6299AE8B269A4B0E49 4 2