尽管 table 中有主键,但导入时 Sqoop --split-by 错误
Sqoop --split-by error while importing despite of having primary key in table
MySQL table 以 dept_id 作为主键
|dept_id | dept_name |
| 2 | Fitness
| 3 | Footwear
| 4 | Apparel
| 5 | Golf
| 6 | Outdoors
| 7 | Fan Shop
Sqoop 查询
sqoop import \
-m 2 \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
-P \
--query 'select * from departments where dept_id < 6 AND $CONDITIONS' \
--target-dir /user/cloudera/sqoop_import/departments;
控制台出错的结果:
When importing query results in parallel, you must specify --split-by
---问题点!---
即使 table 有主键并且拆分可以在 2 个映射器之间平均分配,那么 --spit-by 或 -m 需要什么1 ??
同样指导我。
谢谢。
根据 sqoop docs,
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
因此您必须在 --split-by
标签中指定 主键 。
如果选择1个mapper,Sqoop不会并行拆分任务,在1个mapper中完成导入。
查看我的另一个 answer(如果需要)以了解 $CONDITIONS
的需求和映射器的数量。
它不是 --split-by
用法的主键。由于使用 --query
选项,您看到错误。此选项 必须 在查询中与 --split-by
、--target-dir
和 $CONDITIONS
一起使用。
free_form_query_imports documentations
When importing a free-form query, you must specify a destination
directory with --target-dir.
If you want to import the results of a query in parallel, then each
map task will need to execute a copy of the query, with results
partitioned by bounding conditions inferred by Sqoop. Your query must
include the token $CONDITIONS which each Sqoop process will replace
with a unique condition expression. You must also select a splitting
column with --split-by.
如果您不想使用 --split-by
和 --query
:
,您可以使用 --where
选项
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--where "department_id < 6"
如果您使用 --boundary-query
选项,那么您不需要 --split-by
,--query
选项:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--boundary-query "select 2, 6 from departments limit 1"
By default sqoop will use query select min(<split-by>),
max(<split-by>) from <table name>
to find out boundaries for creating
splits. In some cases this query is not the most optimal so you can
specify any arbitrary query returning two numeric columns using
--boundary-query argument.
之所以sqoop import在使用--query时需要--split-by是因为在"query"中指定数据源位置时,无法know/guess Sqoop 的主键。因为,在查询中,您可以连接两个或多个具有多个键和字段的表。因此,Sqoop 无法 know/guess 它可以根据哪些键进行拆分。
MySQL table 以 dept_id 作为主键
|dept_id | dept_name |
| 2 | Fitness
| 3 | Footwear
| 4 | Apparel
| 5 | Golf
| 6 | Outdoors
| 7 | Fan Shop
Sqoop 查询
sqoop import \
-m 2 \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
-P \
--query 'select * from departments where dept_id < 6 AND $CONDITIONS' \
--target-dir /user/cloudera/sqoop_import/departments;
控制台出错的结果:
When importing query results in parallel, you must specify
--split-by
---问题点!---
即使 table 有主键并且拆分可以在 2 个映射器之间平均分配,那么 --spit-by 或 -m 需要什么1 ??
同样指导我。
谢谢。
根据 sqoop docs,
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
因此您必须在 --split-by
标签中指定 主键 。
如果选择1个mapper,Sqoop不会并行拆分任务,在1个mapper中完成导入。
查看我的另一个 answer(如果需要)以了解 $CONDITIONS
的需求和映射器的数量。
它不是 --split-by
用法的主键。由于使用 --query
选项,您看到错误。此选项 必须 在查询中与 --split-by
、--target-dir
和 $CONDITIONS
一起使用。
free_form_query_imports documentations
When importing a free-form query, you must specify a destination directory with --target-dir.
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
如果您不想使用 --split-by
和 --query
:
--where
选项
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--where "department_id < 6"
如果您使用 --boundary-query
选项,那么您不需要 --split-by
,--query
选项:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--boundary-query "select 2, 6 from departments limit 1"
By default sqoop will use query
select min(<split-by>), max(<split-by>) from <table name>
to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.
之所以sqoop import在使用--query时需要--split-by是因为在"query"中指定数据源位置时,无法know/guess Sqoop 的主键。因为,在查询中,您可以连接两个或多个具有多个键和字段的表。因此,Sqoop 无法 know/guess 它可以根据哪些键进行拆分。