Sqoop导入:复合主键和文本主键
Sqoop import : composite primary key and textual primary key
堆栈:使用 Ambari 2.1 安装 HDP-2.3.2.0-2950
源数据库模式在 sql 服务器上,它包含多个 table 主键为:
- 一个 varchar
- 复合 - 两个 varchar 列或一个 varchar + 一个 int 列或
两个 int 列。有一个大 table 和 ?三行
PK 中的列 一个 int + 两个 varchar 列
根据 Sqoop 文档:
Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
第一个问题是:'manually choose a splitting column' 的预期是什么 - 我怎样才能牺牲 pk 并只使用一列或者我是否遗漏了一些概念?
SQL服务器table是(只有两列,它们形成一个复合主键):
ChassiNo varchar(8) Unchecked
ECU_Name nvarchar(15) Unchecked
我继续导入,源 table 有 7909097 条记录 :
sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|' --table ChassiECU --num-mappers 8 --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose
令人担忧的警告和不正确的映射器输入和记录:
16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job.
16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column.
16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1168400
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1128
HDFS: Number of bytes written=209961941
HDFS: Number of read operations=32
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Job Counters
Launched map tasks=8
Other local map tasks=8
Total time spent by all maps in occupied slots (ms)=62785
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=62785
Total vcore-seconds taken by all map tasks=62785
Total megabyte-seconds taken by all map tasks=128583680
Map-Reduce Framework
Map input records=15818167
Map output records=15818167
Input split bytes=1128
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=780
CPU time spent (ms)=45280
Physical memory (bytes) snapshot=2219433984
Virtual memory (bytes) snapshot=20014182400
Total committed heap usage (bytes)=9394716672
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=209961941
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec)
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.
已创建 table :
CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8),
`ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';
糟糕的结果(没有错误)——问题:15818167 与 7909097(sql 服务器)记录:
> select count(1) from ChassiECU;
Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1446726117927_0059)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 14 14 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.12 s
--------------------------------------------------------------------------------
OK
_c0
15818167
令人惊讶的是,如果复合键由一个 int(用于拆分)组成,我得到的记录要么是准确的,要么是少于 10 条记录的不匹配,但我仍然对这些感到担忧!
我该如何进行?
手动指定拆分列。拆分列不一定等于 PK。您可以拥有复杂的 PK 和一些 int Split 列。您可以指定任何整数列甚至简单函数(一些简单函数,如子字符串或强制转换,而不是聚合或分析)。 拆分列最好应该是均匀分布的整数.
例如,如果您的拆分列包含很少的值为 -1 的行和 1000 万行的值为 10000 - 10000000 且 num-mappers=8,则 sqoop 将不均匀地在映射器之间拆分数据集:
- 第一个映射器将获得带有 -1 的几行,
- 第 2-7 个映射器将获得 0 行,
- 第 8 个映射器将获得近 1000 万行,
这将导致数据倾斜,第 8 个映射器将 运行 永远或
甚至失败。当使用非整数时,我也有 重复
分列与 MS-SQL。因此,使用整数拆分列。在你的情况下
对于只有两个 varchar 列的 table,您可以
(1) 添加代理项 int PK 并将其用作拆分或
(2) 使用带有 WHERE
子句的自定义查询和带有 num-mappers=1 或
的 运行 sqoop 几次手动拆分您的数据
(3) 对您的 varchar 列应用一些 确定性整数非聚合 [=33=] 函数,例如 cast(substr(...) as int) 或 second(timestamp_col)
或 datepart(second, date)
等作为拆分列。
对于 Teradata,您可以使用 AMP 编号:HASHAMP (HASHBUCKET (HASHROW (string_column_list)))
从非整数键列表中获取整数 AMP 编号,并依赖 AMP 之间的 TD 分布。我直接使用简单函数作为拆分,而没有将其作为派生列添加到查询中
堆栈:使用 Ambari 2.1 安装 HDP-2.3.2.0-2950
源数据库模式在 sql 服务器上,它包含多个 table 主键为:
- 一个 varchar
- 复合 - 两个 varchar 列或一个 varchar + 一个 int 列或 两个 int 列。有一个大 table 和 ?三行 PK 中的列 一个 int + 两个 varchar 列
根据 Sqoop 文档:
Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
第一个问题是:'manually choose a splitting column' 的预期是什么 - 我怎样才能牺牲 pk 并只使用一列或者我是否遗漏了一些概念?
SQL服务器table是(只有两列,它们形成一个复合主键):
ChassiNo varchar(8) Unchecked
ECU_Name nvarchar(15) Unchecked
我继续导入,源 table 有 7909097 条记录 :
sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|' --table ChassiECU --num-mappers 8 --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose
令人担忧的警告和不正确的映射器输入和记录:
16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job.
16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column.
16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1168400
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1128
HDFS: Number of bytes written=209961941
HDFS: Number of read operations=32
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Job Counters
Launched map tasks=8
Other local map tasks=8
Total time spent by all maps in occupied slots (ms)=62785
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=62785
Total vcore-seconds taken by all map tasks=62785
Total megabyte-seconds taken by all map tasks=128583680
Map-Reduce Framework
Map input records=15818167
Map output records=15818167
Input split bytes=1128
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=780
CPU time spent (ms)=45280
Physical memory (bytes) snapshot=2219433984
Virtual memory (bytes) snapshot=20014182400
Total committed heap usage (bytes)=9394716672
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=209961941
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec)
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.
已创建 table :
CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8),
`ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';
糟糕的结果(没有错误)——问题:15818167 与 7909097(sql 服务器)记录:
> select count(1) from ChassiECU;
Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1446726117927_0059)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 14 14 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.12 s
--------------------------------------------------------------------------------
OK
_c0
15818167
令人惊讶的是,如果复合键由一个 int(用于拆分)组成,我得到的记录要么是准确的,要么是少于 10 条记录的不匹配,但我仍然对这些感到担忧!
我该如何进行?
手动指定拆分列。拆分列不一定等于 PK。您可以拥有复杂的 PK 和一些 int Split 列。您可以指定任何整数列甚至简单函数(一些简单函数,如子字符串或强制转换,而不是聚合或分析)。 拆分列最好应该是均匀分布的整数.
例如,如果您的拆分列包含很少的值为 -1 的行和 1000 万行的值为 10000 - 10000000 且 num-mappers=8,则 sqoop 将不均匀地在映射器之间拆分数据集:
- 第一个映射器将获得带有 -1 的几行,
- 第 2-7 个映射器将获得 0 行,
- 第 8 个映射器将获得近 1000 万行,
这将导致数据倾斜,第 8 个映射器将 运行 永远或 甚至失败。当使用非整数时,我也有 重复 分列与 MS-SQL。因此,使用整数拆分列。在你的情况下 对于只有两个 varchar 列的 table,您可以
(1) 添加代理项 int PK 并将其用作拆分或
(2) 使用带有 WHERE
子句的自定义查询和带有 num-mappers=1 或
(3) 对您的 varchar 列应用一些 确定性整数非聚合 [=33=] 函数,例如 cast(substr(...) as int) 或 second(timestamp_col)
或 datepart(second, date)
等作为拆分列。
对于 Teradata,您可以使用 AMP 编号:HASHAMP (HASHBUCKET (HASHROW (string_column_list)))
从非整数键列表中获取整数 AMP 编号,并依赖 AMP 之间的 TD 分布。我直接使用简单函数作为拆分,而没有将其作为派生列添加到查询中