SQL 按连续递增序列拆分数据,然后按模式对每个数据进行子集化
SQL Split data by continuous increasing sequence & then subset each by a pattern
我有数据试图从中识别模式。但是,每个 table 中的数据都不完整(缺少行)。我想将 table 分成完整数据块,然后从每个数据块中识别模式。我有一列可以用来确定数据是否完整或未调用 sequence
.
数据将如下所示:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
首先,我想将数据分成完整的部分:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
---------------------------
8 closed
9 open
---------------------------
11 open
---------------------------
13 closed
14 open
15 open
---------------------------
18 closed
19 open
20 closed
然后我想确定模式 closed open, ..., open, closed
以便我们从关闭到打开 n 行(其中 n 至少为 1)然后返回到关闭
根据示例数据,这将留下:
Sequence Position
2 closed
3 open
4 open
5 closed
---------------------------
18 closed
19 open
20 closed
这让我的最终 table 可以进行分析,因为我知道没有中断的序列。如果更容易使用,我还有另一列,其中 position
是二进制的。
table 很大,所以虽然我认为我可以编写循环来计算结果,但我认为该方法不够有效。或者我打算将整个 table 拉入 R
,然后找到结果 table 但这需要先将所有内容拉入 R
所以我想知道这是否可行在 SQL
编辑:更具代表性的不同样本数据:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
21 closed
22 closed
23 closed
24 open
25 open
26 closed
27 open
请注意,这应该有相同的结果,但也有
23 closed
24 open
25 open
26 closed
21
、22
和 27
不是因为它们不适合 closed
、open
...、open
、 closed
模式
但是,如果我们有 28 closed
,我们会想要 27
和 28
,因为没有时间间隔并且模式会适合。如果不是 28
,而是 29 closed
,我们就不会想要 27
或 29
(因为虽然模式是正确的,但序列中断了)。
To add some context, think of a machine that goes from stop, to running, to stopped. We record the data, but have gaps in the recording which here are represented by the breaking of the sequences. As well as missing data in the middle of the stop running stop cycle; the data also sometimes starts recording when the machine is already running or stops recording before the machine stops. I don't want that data as it is not a complete cycle of stop, running, stop. I only want those complete cycles, and where the sequence was continuous.
This means I can transform my original data set into one with only complete cycles one after the other.
我认为实际上有一个相对简单的方法来看待这个问题。您可以通过以下方式识别结束序列号:
- 查看之前收盘的顺序
- 查看前一收盘价和当前收盘价的累计开盘价
- 计算以确保所有中间体都在数据中
这变成了一个查询:
select t.*,
lag(sequence) over (partition by position order by sequence) as prev_sequence,
lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
from t
) t
where position = 'close' and
(cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
sequence > prev_sequence - 1;
现在您已经确定了序列,您可以返回以获取原始行:
select t.*
from t join
(select t.*,
lag(sequence) over (partition by position order by sequence) as prev_sequence,
lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
from t
) t
where position = 'close' and
(cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
sequence > prev_sequence - 1
) seqs
on t.sequence between seqs.prev_sequence and seqs.sequence;
我承认我没有测试过这个。不过,我确实认为这个想法可行。一件事是它将为每个序列组选择多个 "close" 周期。
你可以使用它。
DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10))
INSERT INTO @MyTable
VALUES
(1,'open'),
(2,'closed') ,
(3,'open'),
(4,'open'),
(5,'closed'),
(8,'closed'),
(9,'open'),
(11,'open'),
(13,'closed'),
(14,'open') ,
(15,'open'),
(18,'closed'),
(19,'open'),
(20,'closed'),
(21,'closed'),
(22,'closed'),
(23,'closed'),
(24,'open'),
(25,'open'),
(26,'closed'),
(27,'open')
;WITH CTE AS(
SELECT * ,
CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark
FROM @MyTable
)
,CTE_2 AS
(
SELECT
[New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING ))
, [Sequence]
, Position
FROM CTE
)
,CTE_3 AS (
SELECT *,
RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence])
FROM CTE_2
)
,CTE_4 AS
(
SELECT ([New_Sequence] - RN) G
, MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq
, MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq
FROM CTE_3
GROUP BY ([New_Sequence] - RN)
)
SELECT
CTE.Sequence, CTE.Position
FROM CTE_4
INNER JOIN CTE ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq)
WHERE
CTE_4.MaxCloseSq > CTE_4.MinCloseSq
AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL)
结果:
Sequence Position
----------- ----------
2 closed
3 open
4 open
5 closed
--- ---
18 closed
19 open
20 closed
--- ---
23 closed
24 open
25 open
26 closed
我有数据试图从中识别模式。但是,每个 table 中的数据都不完整(缺少行)。我想将 table 分成完整数据块,然后从每个数据块中识别模式。我有一列可以用来确定数据是否完整或未调用 sequence
.
数据将如下所示:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
首先,我想将数据分成完整的部分:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
---------------------------
8 closed
9 open
---------------------------
11 open
---------------------------
13 closed
14 open
15 open
---------------------------
18 closed
19 open
20 closed
然后我想确定模式 closed open, ..., open, closed
以便我们从关闭到打开 n 行(其中 n 至少为 1)然后返回到关闭
根据示例数据,这将留下:
Sequence Position
2 closed
3 open
4 open
5 closed
---------------------------
18 closed
19 open
20 closed
这让我的最终 table 可以进行分析,因为我知道没有中断的序列。如果更容易使用,我还有另一列,其中 position
是二进制的。
table 很大,所以虽然我认为我可以编写循环来计算结果,但我认为该方法不够有效。或者我打算将整个 table 拉入 R
,然后找到结果 table 但这需要先将所有内容拉入 R
所以我想知道这是否可行在 SQL
编辑:更具代表性的不同样本数据:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
21 closed
22 closed
23 closed
24 open
25 open
26 closed
27 open
请注意,这应该有相同的结果,但也有
23 closed
24 open
25 open
26 closed
21
、22
和 27
不是因为它们不适合 closed
、open
...、open
、 closed
模式
但是,如果我们有 28 closed
,我们会想要 27
和 28
,因为没有时间间隔并且模式会适合。如果不是 28
,而是 29 closed
,我们就不会想要 27
或 29
(因为虽然模式是正确的,但序列中断了)。
To add some context, think of a machine that goes from stop, to running, to stopped. We record the data, but have gaps in the recording which here are represented by the breaking of the sequences. As well as missing data in the middle of the stop running stop cycle; the data also sometimes starts recording when the machine is already running or stops recording before the machine stops. I don't want that data as it is not a complete cycle of stop, running, stop. I only want those complete cycles, and where the sequence was continuous. This means I can transform my original data set into one with only complete cycles one after the other.
我认为实际上有一个相对简单的方法来看待这个问题。您可以通过以下方式识别结束序列号:
- 查看之前收盘的顺序
- 查看前一收盘价和当前收盘价的累计开盘价
- 计算以确保所有中间体都在数据中
这变成了一个查询:
select t.*,
lag(sequence) over (partition by position order by sequence) as prev_sequence,
lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
from t
) t
where position = 'close' and
(cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
sequence > prev_sequence - 1;
现在您已经确定了序列,您可以返回以获取原始行:
select t.*
from t join
(select t.*,
lag(sequence) over (partition by position order by sequence) as prev_sequence,
lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
from t
) t
where position = 'close' and
(cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
sequence > prev_sequence - 1
) seqs
on t.sequence between seqs.prev_sequence and seqs.sequence;
我承认我没有测试过这个。不过,我确实认为这个想法可行。一件事是它将为每个序列组选择多个 "close" 周期。
你可以使用它。
DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10))
INSERT INTO @MyTable
VALUES
(1,'open'),
(2,'closed') ,
(3,'open'),
(4,'open'),
(5,'closed'),
(8,'closed'),
(9,'open'),
(11,'open'),
(13,'closed'),
(14,'open') ,
(15,'open'),
(18,'closed'),
(19,'open'),
(20,'closed'),
(21,'closed'),
(22,'closed'),
(23,'closed'),
(24,'open'),
(25,'open'),
(26,'closed'),
(27,'open')
;WITH CTE AS(
SELECT * ,
CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark
FROM @MyTable
)
,CTE_2 AS
(
SELECT
[New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING ))
, [Sequence]
, Position
FROM CTE
)
,CTE_3 AS (
SELECT *,
RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence])
FROM CTE_2
)
,CTE_4 AS
(
SELECT ([New_Sequence] - RN) G
, MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq
, MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq
FROM CTE_3
GROUP BY ([New_Sequence] - RN)
)
SELECT
CTE.Sequence, CTE.Position
FROM CTE_4
INNER JOIN CTE ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq)
WHERE
CTE_4.MaxCloseSq > CTE_4.MinCloseSq
AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL)
结果:
Sequence Position
----------- ----------
2 closed
3 open
4 open
5 closed
--- ---
18 closed
19 open
20 closed
--- ---
23 closed
24 open
25 open
26 closed