OLAP 函数处理 - 为什么 运行 在 N/M 分区上 M 次比 N 条记录快 1 次
OLAP Function Processing - Why is it faster to run on N/M partitions M times than N records 1 time
我有一个(非常大)table像这样
CREATE SET TABLE LOAN
( LoanNumber VARCHAR(100),
LoanBalance DECIMAL(18,4),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber)
PARTITION BY RANGE_N
( ROW_INS_TS BETWEEN
TIMESTAMP '2017-01-01 00:00:00+00:00'
AND TIMESTAMP '2017-12-31 23:59:59+00:00'
EACH INTERVAL '1' DAY
);
这个 table 通常由快照汇总,例如 4 月月底的快照将是
-- Pretend there is just 2017 data there
CREATE SET TABLE LOAN_APRIL AS
( SELECT *
FROM LOAN
WHERE RecTimeStamp <= DATE '2017-04-30'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1
)
PRIMARY INDEX (LoanNumber);
这通常需要很长时间才能 运行。不过我昨天进行了试验,发现通过像这样将其分开
我有很好的执行时间
CREATE SET TABLE LOAN_APRIL_TMP
( LoanNumber VARCHAR(100),
LoanBalance DECIMAL(18,4),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber);
CREATE SET TABLE LOAN_APRIL
( LoanNumber VARCHAR(100),
LoanBalance DECIMAL(18,4),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber);
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-01-01' AND DATE '2017-01-31'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-02-01' AND DATE '2017-02-28'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-03-01' AND DATE '2017-03-31'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-04-01' AND DATE '2017-04-30'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL
SELECT *
FROM LOAN_APRIL_TMP
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
我只是 运行 按顺序执行,所以它们没有并行执行。今天我要试验一下如何让每个段并行加载。
此外,更重要的是,我很难找到足够的技术文档来解决这些类型的问题。有这方面的好资源吗?我知道有很多关于礼仪的问题,但必须有一些东西来描述,至少在高层次上,这些功能的实现。
可能有多种原因。您应该检查 DBQL 以查看实际的资源使用差异。
与较小的 Select 相比,第一个 Select 中的数据分散在更多分区中。
Explain 可能会显示 spool 不会在内存中缓存 对于大的 Select,但对于单独的
]
order by
中的VarChars被扩展为定义大小的Chars,如果LoanNumber
实际上是一个VarChar(100)
(我怀疑是)它会增加spool ,也是(但这是其他查询的常见问题 table)。
OLAP 函数有一个缺点,它们需要两个线轴,即线轴大小加倍。如果此 table 有很多 column/large 行,则 运行 ROW_NUMBER 仅针对 table 的 PK 可能更有效,然后像这样加入这个:
CREATE SET TABLE LOAN_APRIL_TMP
( LoanNumber VARCHAR(100),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber) -- same PPI as source table to facilitate fast join back
PARTITION BY RANGE_N
( ROW_INS_TS BETWEEN
TIMESTAMP '2017-01-01 00:00:00+00:00'
AND TIMESTAMP '2017-12-31 23:59:59+00:00'
EACH INTERVAL '1' DAY
);
INSERT INTO LOAN_APRIL_TMP
SELECT LoanNumber, RecTimeStamp -- no other columns
FROM LOAN
WHERE RecTimeStamp <= DATE '2017-04-30'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1
;
INSERT INTO LOAN_APRIL
SELECT l.* -- now get all columns
FROM LOAN AS l
JOIN LOAN_APRIL_TMP AS AS tmp
ON l.LoanNumber = tmp.LoanNumber
AND l.RecTimeStamp = tmp.RecTimeStamp
我有一个(非常大)table像这样
CREATE SET TABLE LOAN
( LoanNumber VARCHAR(100),
LoanBalance DECIMAL(18,4),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber)
PARTITION BY RANGE_N
( ROW_INS_TS BETWEEN
TIMESTAMP '2017-01-01 00:00:00+00:00'
AND TIMESTAMP '2017-12-31 23:59:59+00:00'
EACH INTERVAL '1' DAY
);
这个 table 通常由快照汇总,例如 4 月月底的快照将是
-- Pretend there is just 2017 data there
CREATE SET TABLE LOAN_APRIL AS
( SELECT *
FROM LOAN
WHERE RecTimeStamp <= DATE '2017-04-30'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1
)
PRIMARY INDEX (LoanNumber);
这通常需要很长时间才能 运行。不过我昨天进行了试验,发现通过像这样将其分开
我有很好的执行时间CREATE SET TABLE LOAN_APRIL_TMP
( LoanNumber VARCHAR(100),
LoanBalance DECIMAL(18,4),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber);
CREATE SET TABLE LOAN_APRIL
( LoanNumber VARCHAR(100),
LoanBalance DECIMAL(18,4),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber);
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-01-01' AND DATE '2017-01-31'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-02-01' AND DATE '2017-02-28'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-03-01' AND DATE '2017-03-31'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-04-01' AND DATE '2017-04-30'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL
SELECT *
FROM LOAN_APRIL_TMP
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
我只是 运行 按顺序执行,所以它们没有并行执行。今天我要试验一下如何让每个段并行加载。
此外,更重要的是,我很难找到足够的技术文档来解决这些类型的问题。有这方面的好资源吗?我知道有很多关于礼仪的问题,但必须有一些东西来描述,至少在高层次上,这些功能的实现。
可能有多种原因。您应该检查 DBQL 以查看实际的资源使用差异。
与较小的 Select 相比,第一个 Select 中的数据分散在更多分区中。
Explain 可能会显示 spool 不会在内存中缓存 对于大的 Select,但对于单独的
]order by
中的VarChars被扩展为定义大小的Chars,如果LoanNumber
实际上是一个VarChar(100)
(我怀疑是)它会增加spool ,也是(但这是其他查询的常见问题 table)。
OLAP 函数有一个缺点,它们需要两个线轴,即线轴大小加倍。如果此 table 有很多 column/large 行,则 运行 ROW_NUMBER 仅针对 table 的 PK 可能更有效,然后像这样加入这个:
CREATE SET TABLE LOAN_APRIL_TMP
( LoanNumber VARCHAR(100),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber) -- same PPI as source table to facilitate fast join back
PARTITION BY RANGE_N
( ROW_INS_TS BETWEEN
TIMESTAMP '2017-01-01 00:00:00+00:00'
AND TIMESTAMP '2017-12-31 23:59:59+00:00'
EACH INTERVAL '1' DAY
);
INSERT INTO LOAN_APRIL_TMP
SELECT LoanNumber, RecTimeStamp -- no other columns
FROM LOAN
WHERE RecTimeStamp <= DATE '2017-04-30'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1
;
INSERT INTO LOAN_APRIL
SELECT l.* -- now get all columns
FROM LOAN AS l
JOIN LOAN_APRIL_TMP AS AS tmp
ON l.LoanNumber = tmp.LoanNumber
AND l.RecTimeStamp = tmp.RecTimeStamp