在 teradata 中将数十亿条记录从一个 table 移动到另一个

Move billions of records from one table to another in teradata

我有一个 teradata 视图,每天包含 10 亿条记录,我需要处理 1 年的数据,所以我们有 3650 亿条记录,数据按日期分区 - 每天间隔。

我需要插入 - select 3 个 ID 列(数据将根据这些进行分组)和 2 个带有度量值的列(需要使用 SUM 聚合函数)

查询如下:

Insert into table1
Select 
  col1, col2, col3, SUM(col4), SUM(col5)
FROM 
  table2
GROUP BY 
  col1, col2, col3
WHERE coldate between 'date1' and 'date2';

问题是,如果我 运行 一天并且我需要 运行 这个 1 年,查询就会继续执行(20 分钟内没有完成)。

我应该如何处理 - 我应该使用 MLOAD - 插入 select 还是其他什么?

请建议,需要尽快解决。谢谢

Explain SELECT 
    ORIGINATING_NUMBER_VAL,
    SUM(ACTIVITY_DURATION_MEAS),
    SUM(Upload_Data_Volume),
    SUM(Download_Data_Volume)
FROM 
    dp_tab_view.NETWORK_ACTIVITY_DATA_RES
WHERE
    CAST(Activity_Start_Dttm as DATE) between '2014-12-01' AND '2014-12-31'
GROUP BY 
    ORIGINATING_NUMBER_VAL;

  1) First, we lock DP_TAB.NETWORK_ACTIVITY_DATA_RES in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES for access, we lock
     DP_TAB.NETWORK_ACTIVITY_DATA_BLC_2013 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES for access, we lock
     DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ4_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES for access, we lock
     DP_TAB.NETWORK_ACTIVITY_DATA_RES_BLC in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES for access, we lock
     DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ2_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES for access, we lock
     DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ1_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES for access, and we lock
     DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ3_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES for access.
  2) Next, we do an all-AMPs RETRIEVE step from 31 partitions of
     DP_TAB.NETWORK_ACTIVITY_DATA_RES in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES with a condition of (
     "(DP_TAB.NETWORK_ACTIVITY_DATA_RES in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-12-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_RES in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '3015-02-09 00:00:00') AND
     (DP_TAB.NETWORK_ACTIVITY_DATA_RES in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm <
     TIMESTAMP '2015-01-01 00:00:00'))") into Spool 1 (all_amps), which
     is built locally on the AMPs.  The input table will not be cached
     in memory, but it is eligible for synchronized scanning.  The size
     of Spool 1 is estimated with low confidence to be 1 row (70 bytes).
     The estimated time for this step is 37.22 seconds.
  3) We do an all-AMPs RETRIEVE step from 31 partitions of
     DP_TAB.NETWORK_ACTIVITY_DATA_RES_BLC in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES with a condition of (
     "(DP_TAB.NETWORK_ACTIVITY_DATA_RES_BLC in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-12-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_RES_BLC in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm <
     TIMESTAMP '2015-01-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_RES_BLC in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-10-13 00:00:00') AND
     (DP_TAB.NETWORK_ACTIVITY_DATA_RES_BLC in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm <
     TIMESTAMP '3015-02-10 00:00:00')))") into Spool 1 (all_amps),
     which is built locally on the AMPs.  The input table will not be
     cached in memory, but it is eligible for synchronized scanning.
     The result spool file will not be cached in memory.  The size of
     Spool 1 is estimated with low confidence to be 22,856,337,679 rows
     (1,599,943,637,530 bytes).  The estimated time for this step is 1
     hour and 52 minutes.
  4) We do an all-AMPs RETRIEVE step from 0 partitions of
     DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ1_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES by way of an all-rows scan
     with a condition of ("(DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ1_14 in
     view dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-12-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ1_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm<
     TIMESTAMP '2015-01-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ1_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm <
     TIMESTAMP '2014-04-01 00:00:00') AND
     (DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ1_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-01-01 00:00:00')))") into Spool 1 (all_amps),
     which is built locally on the AMPs.  The input table will not be
     cached in memory, but it is eligible for synchronized scanning.
     The size of Spool 1 is estimated with low confidence to be
     22,856,337,680 rows (1,599,943,637,600 bytes).  The estimated time
     for this step is 0.01 seconds.
  5) We do an all-AMPs RETRIEVE step from 0 partitions of
     DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ2_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES by way of an all-rows scan
     with a condition of ("(DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ2_14 in
     view dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-12-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ2_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm<
     TIMESTAMP '2015-01-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ2_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm <
     TIMESTAMP '2014-07-01 00:00:00') AND
     (DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ2_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-04-01 00:00:00')))") into Spool 1 (all_amps),
     which is built locally on the AMPs.  The input table will not be
     cached in memory, but it is eligible for synchronized scanning.
     The size of Spool 1 is estimated with low confidence to be
     22,856,337,681 rows (1,599,943,637,670 bytes).  The estimated time
     for this step is 0.01 seconds.
  6) We do an all-AMPs RETRIEVE step from 0 partitions of
     DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ3_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES by way of an all-rows scan
     with a condition of ("(DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ3_14 in
     view dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-12-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ3_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm<
     TIMESTAMP '2014-10-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ3_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-07-01 00:00:00') AND
     (DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ3_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm <
     TIMESTAMP '2015-01-01 00:00:00')))") into Spool 1 (all_amps),
     which is built locally on the AMPs.  The input table will not be
     cached in memory, but it is eligible for synchronized scanning.
     The size of Spool 1 is estimated with low confidence to be
     22,856,337,682 rows (1,599,943,637,740 bytes).  The estimated time
     for this step is 0.01 seconds.
  7) We do an all-AMPs RETRIEVE step from 0 partitions of
     DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ4_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES by way of an all-rows scan
     with a condition of ("(DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ4_14 in
     view dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-12-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ4_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm<
     TIMESTAMP '2015-01-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ4_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm <
     TIMESTAMP '2014-10-13 00:00:00') AND
     (DP_TAB.NETWORK_ACTIVITY_DATA_BLCQ4_14 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-10-01 00:00:00')))") into Spool 1 (all_amps),
     which is built locally on the AMPs.  The input table will not be
     cached in memory, but it is eligible for synchronized scanning.
     The size of Spool 1 is estimated with low confidence to be
     22,856,337,683 rows (1,599,943,637,810 bytes).  The estimated time
     for this step is 0.01 seconds.
  8) We do an all-AMPs RETRIEVE step from 0 partitions of
     DP_TAB.NETWORK_ACTIVITY_DATA_BLC_2013 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES by way of an all-rows scan
     with a condition of ("(DP_TAB.NETWORK_ACTIVITY_DATA_BLC_2013 in
     view dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm >=
     TIMESTAMP '2014-12-01 00:00:00') AND
     ((DP_TAB.NETWORK_ACTIVITY_DATA_BLC_2013 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm<
     TIMESTAMP '2014-01-01 00:00:00') AND
     (DP_TAB.NETWORK_ACTIVITY_DATA_BLC_2013 in view
     dp_tab_view.NETWORK_ACTIVITY_DATA_RES.Activity_Start_Dttm <
     TIMESTAMP '2015-01-01 00:00:00'))") into Spool 1 (all_amps), which
     is built locally on the AMPs.  The input table will not be cached
     in memory, but it is eligible for synchronized scanning.  The size
     of Spool 1 is estimated with low confidence to be 22,856,337,684
     rows (1,599,943,637,880 bytes).  The estimated time for this step
     is 0.01 seconds.
  9) We do an all-AMPs SUM step to aggregate from Spool 1 (Last Use) by
     way of an all-rows scan with a condition of (
     "((CAST((NETWORK_ACTIVITY_DATA_RES.ACTIVITY_START_DTTM) AS
     DATE))>= DATE '2014-12-01') AND
     ((CAST((NETWORK_ACTIVITY_DATA_RES.ACTIVITY_START_DTTM) AS DATE))<=
     DATE '2014-12-31')") , grouping by field1 ( ORIGINATING_NUMBER_VAL).
     Aggregate Intermediate Results are computed globally, then placed
     in Spool 4.  The aggregate spool file will not be cached in memory.
     The size of Spool 4 is estimated with low confidence to be
     17,142,253,263 rows (1,628,514,059,985 bytes).  The estimated time
     for this step is 6 hours and 28 minutes.
 10) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of
     an all-rows scan into Spool 2 (group_amps), which is built locally
     on the AMPs.  The result spool file will not be cached in memory.
     The size of Spool 2 is estimated with low confidence to be
     17,142,253,263 rows (1,165,673,221,884 bytes).  The estimated time
     for this step is 21 minutes and 27 seconds.
 11) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 2 are sent back to the user as the result of
     statement 1.  The total estimated time is 8 hours and 42 minutes.

按照@JNevill 的建议,将目标 table 创建为 MULTISET 始终是一个好主意。除此之外,你能做的不多,因为这个计划看起来很合理。

由于您似乎在源 table (We do an all-AMPs RETRIEVE step from 31 partitions of) 中有每日分区,您可以 运行 一系列较小的每日查询 - 它不会更快,但是:

  • 您会逐步获得结果,
  • 万一失败,您不必在查询 运行ning 几个小时后重新开始
  • 你会有更好的 ETA,因为你很快就会有前几天的实际执行时间。 EXPLAIN中的数字可能与实际时间有很大出入。