SQL 按连续递增序列拆分数据,然后按模式对每个数据进行子集化

SQL Split data by continuous increasing sequence & then subset each by a pattern

我有数据试图从中识别模式。但是,每个 table 中的数据都不完整(缺少行)。我想将 table 分成完整数据块,然后从每个数据块中识别模式。我有一列可以用来确定数据是否完整或未调用 sequence.

数据将如下所示:

Sequence      Position 
1              open
2              closed 
3              open
4              open
5              closed
8              closed
9              open
11             open
13             closed
14             open 
15             open
18             closed
19             open
20             closed

首先,我想将数据分成完整的部分:

   Sequence      Position 
    1              open
    2              closed 
    3              open
    4              open
    5              closed
---------------------------
    8              closed
    9              open
---------------------------
    11             open
---------------------------
    13             closed
    14             open 
    15             open
---------------------------
    18             closed
    19             open
    20             closed

然后我想确定模式 closed open, ..., open, closed 以便我们从关闭到打开 n 行(其中 n 至少为 1)然后返回到关闭

根据示例数据,这将留下:

     Sequence        Position 
        2              closed 
        3              open
        4              open
        5              closed
    ---------------------------
        18             closed
        19             open
        20             closed

这让我的最终 table 可以进行分析,因为我知道没有中断的序列。如果更容易使用,我还有另一列,其中 position 是二进制的。

table 很大,所以虽然我认为我可以编写循环来计算结果,但我认为该方法不够有效。或者我打算将整个 table 拉入 R,然后找到结果 table 但这需要先将所有内容拉入 R 所以我想知道这是否可行在 SQL

编辑:更具代表性的不同样本数据:

Sequence      Position 
    1              open
    2              closed 
    3              open
    4              open
    5              closed
    8              closed
    9              open
    11             open
    13             closed
    14             open 
    15             open
    18             closed
    19             open
    20             closed
    21             closed
    22             closed
    23             closed
    24             open
    25             open
    26             closed
    27             open

请注意,这应该有相同的结果,但也有

    23             closed
    24             open
    25             open
    26             closed

212227 不是因为它们不适合 closedopen...、openclosed 模式

但是,如果我们有 28 closed,我们会想要 2728,因为没有时间间隔并且模式会适合。如果不是 28,而是 29 closed,我们就不会想要 2729(因为虽然模式是正确的,但序列中断了)。

To add some context, think of a machine that goes from stop, to running, to stopped. We record the data, but have gaps in the recording which here are represented by the breaking of the sequences. As well as missing data in the middle of the stop running stop cycle; the data also sometimes starts recording when the machine is already running or stops recording before the machine stops. I don't want that data as it is not a complete cycle of stop, running, stop. I only want those complete cycles, and where the sequence was continuous. This means I can transform my original data set into one with only complete cycles one after the other.

我认为实际上有一个相对简单的方法来看待这个问题。您可以通过以下方式识别结束序列号:

  • 查看之前收盘的顺序
  • 查看前一收盘价和当前收盘价的累计开盘价
  • 计算以确保所有中间体都在数据中

这变成了一个查询:

select t.*,
       lag(sequence) over (partition by position order by sequence) as prev_sequence,
       lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
             sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
      from t
     ) t
where position = 'close' and
      (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
      sequence > prev_sequence - 1;

现在您已经确定了序列,您可以返回以获取原始行:

select t.*
from t join
     (select t.*,
             lag(sequence) over (partition by position order by sequence) as prev_sequence,
             lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
      from (select t.*,
                   sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
            from t
           ) t
      where position = 'close' and
            (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
            sequence > prev_sequence - 1
     ) seqs
     on t.sequence between seqs.prev_sequence and seqs.sequence;

我承认我没有测试过这个。不过,我确实认为这个想法可行。一件事是它将为每个序列组选择多个 "close" 周期。

你可以使用它。

DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10))

INSERT INTO @MyTable
VALUES
(1,'open'),
(2,'closed') ,
(3,'open'),
(4,'open'),
(5,'closed'),
(8,'closed'),
(9,'open'),
(11,'open'),
(13,'closed'),
(14,'open') ,
(15,'open'),
(18,'closed'),
(19,'open'),
(20,'closed'),
(21,'closed'),
(22,'closed'),
(23,'closed'),
(24,'open'),
(25,'open'),
(26,'closed'),
(27,'open')


;WITH CTE AS(
    SELECT * ,
        CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark
    FROM @MyTable
)
,CTE_2 AS 
(
    SELECT 
        [New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING )) 
        , [Sequence]
        , Position
     FROM CTE
)
,CTE_3 AS (
    SELECT *, 
    RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence]) 
    FROM CTE_2
)
,CTE_4 AS
(
    SELECT ([New_Sequence] - RN) G
    , MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq
    , MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq
    FROM CTE_3 
    GROUP BY ([New_Sequence] - RN)
)
SELECT
    CTE.Sequence, CTE.Position
FROM CTE_4 
    INNER JOIN CTE  ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq)
WHERE
    CTE_4.MaxCloseSq > CTE_4.MinCloseSq
    AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL)

结果:

Sequence    Position
----------- ----------
2           closed
3           open
4           open
5           closed
---         ---
18          closed
19          open
20          closed
---         ---
23          closed
24          open
25          open
26          closed