SQL 基于两列的引导函数
SQL lead function based on two columns
我有一个 table 大约有 7 亿行,下面的示例只有一个 line_id。
LINE_ID|COLLECTION_DATE |DSL_CARD_TYPE|
-------|-------------------|-------------|
1234567|2020-03-25 08:46:08|ADSL_PORT |
1234567|2020-03-26 08:31:48|ADSL_PORT |
1234567|2020-03-27 08:42:40|VDSL_PORT |
1234567|2020-03-28 08:36:32|VDSL_PORT |
1234567|2020-03-29 08:31:33|VDSL_PORT |
1234567|2020-03-30 08:50:15|VDSL_PORT |
1234567|2020-04-31 08:44:33|ADSL_PORT |
1234567|2020-03-01 08:34:53|ADSL_PORT |
1234567|2020-04-02 08:44:11|ADSL_PORT |
1234567|2020-04-03 08:43:51|VDSL_PORT |
1234567|2020-04-04 08:54:33|ADSL_PORT |
1234567|2020-04-05 09:06:47|ADSL_PORT |
1234567|2020-04-06 09:06:57|VDSL_PORT |
1234567|2020-04-07 09:13:32|VDSL_PORT |
我需要对 DSL_CARD_TYPE
进行分组并创建一个名为 Next_COLLECTION_DATE
的新列
获得下一个 DSL_CARD_TYPE 如下所示
LINE_ID|COLLECTION_DATE |Next_COLLECTION_DATE |DSL_CARD_TYPE|
-------|-------------------|----------------------|-------------|
1234567|2020-03-25 08:46:08|2020-03-26 08:31:48 |ADSL_PORT |
1234567|2020-03-27 08:42:40|2020-03-30 08:50:15 |VDSL_PORT |
1234567|2020-03-31 08:34:53|2020-04-02 08:44:11 |ADSL_PORT |
1234567|2020-04-03 08:43:51|2020-04-03 08:43:51 |VDSL_PORT |
1234567|2020-04-04 08:54:33|2020-04-05 09:06:47 |ADSL_PORT |
1234567|2020-04-06 09:06:57|2020-04-07 09:13:32 |VDSL_PORT |
我创建了一个非常虚拟和复杂的查询来完成这项工作,但是对于如此庞大的数据量,它需要几个小时
COALESCE (lead (COLLECTION_DATE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE),NOW() )Next_Collection_Date,
DSL_CARD_TYPE
FROM (
SELECT * FROM (
SELECT
LINE_ID, COLLECTION_DATE,
DSL_CARD_TYPE ,
lead (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) To_Sync_Port,
lag (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) B_Sync_Port
FROM
ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL
WHERE SYNC_PORT <> TO_SYNC_PORT OR B_Sync_Port IS NULL )abc2```
这看起来像一个 gaps-and-islands 问题,在这种情况下最好使用行号的差异来解决:
select line_id, dsl_card_type, min(collection_date), max(collection_date)
from (select v.*,
row_number() over (partition by line_id order by collection_date) as seqnum,
row_number() over (partition by line_id, dsl_card_type order by collection_date) as seqnum_2
from ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL v
where collection_date >= '2020-07-27 00:00:00'
) v
group by line_id, dsl_card_type, (seqnum - seqnum_2);
解释它是如何工作的有点棘手。如果你 运行 子查询,你可以看到两个行号之间的差异如何定义具有相同卡片类型的相邻行。
我有一个 table 大约有 7 亿行,下面的示例只有一个 line_id。
LINE_ID|COLLECTION_DATE |DSL_CARD_TYPE|
-------|-------------------|-------------|
1234567|2020-03-25 08:46:08|ADSL_PORT |
1234567|2020-03-26 08:31:48|ADSL_PORT |
1234567|2020-03-27 08:42:40|VDSL_PORT |
1234567|2020-03-28 08:36:32|VDSL_PORT |
1234567|2020-03-29 08:31:33|VDSL_PORT |
1234567|2020-03-30 08:50:15|VDSL_PORT |
1234567|2020-04-31 08:44:33|ADSL_PORT |
1234567|2020-03-01 08:34:53|ADSL_PORT |
1234567|2020-04-02 08:44:11|ADSL_PORT |
1234567|2020-04-03 08:43:51|VDSL_PORT |
1234567|2020-04-04 08:54:33|ADSL_PORT |
1234567|2020-04-05 09:06:47|ADSL_PORT |
1234567|2020-04-06 09:06:57|VDSL_PORT |
1234567|2020-04-07 09:13:32|VDSL_PORT |
我需要对 DSL_CARD_TYPE
进行分组并创建一个名为 Next_COLLECTION_DATE
的新列
获得下一个 DSL_CARD_TYPE 如下所示
LINE_ID|COLLECTION_DATE |Next_COLLECTION_DATE |DSL_CARD_TYPE|
-------|-------------------|----------------------|-------------|
1234567|2020-03-25 08:46:08|2020-03-26 08:31:48 |ADSL_PORT |
1234567|2020-03-27 08:42:40|2020-03-30 08:50:15 |VDSL_PORT |
1234567|2020-03-31 08:34:53|2020-04-02 08:44:11 |ADSL_PORT |
1234567|2020-04-03 08:43:51|2020-04-03 08:43:51 |VDSL_PORT |
1234567|2020-04-04 08:54:33|2020-04-05 09:06:47 |ADSL_PORT |
1234567|2020-04-06 09:06:57|2020-04-07 09:13:32 |VDSL_PORT |
我创建了一个非常虚拟和复杂的查询来完成这项工作,但是对于如此庞大的数据量,它需要几个小时
COALESCE (lead (COLLECTION_DATE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE),NOW() )Next_Collection_Date,
DSL_CARD_TYPE
FROM (
SELECT * FROM (
SELECT
LINE_ID, COLLECTION_DATE,
DSL_CARD_TYPE ,
lead (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) To_Sync_Port,
lag (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) B_Sync_Port
FROM
ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL
WHERE SYNC_PORT <> TO_SYNC_PORT OR B_Sync_Port IS NULL )abc2```
这看起来像一个 gaps-and-islands 问题,在这种情况下最好使用行号的差异来解决:
select line_id, dsl_card_type, min(collection_date), max(collection_date)
from (select v.*,
row_number() over (partition by line_id order by collection_date) as seqnum,
row_number() over (partition by line_id, dsl_card_type order by collection_date) as seqnum_2
from ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL v
where collection_date >= '2020-07-27 00:00:00'
) v
group by line_id, dsl_card_type, (seqnum - seqnum_2);
解释它是如何工作的有点棘手。如果你 运行 子查询,你可以看到两个行号之间的差异如何定义具有相同卡片类型的相邻行。