尽管 table 中有主键,但导入时 Sqoop --split-by 错误

Sqoop --split-by error while importing despite of having primary key in table

MySQL table 以 dept_id 作为主键

|dept_id | dept_name |  
| 2 | Fitness   
| 3 | Footwear  
| 4 | Apparel   
| 5 | Golf      
| 6 | Outdoors  
| 7 | Fan Shop  

Sqoop 查询

sqoop import \  
-m 2 \  
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \  
--username retail_dba \  
-P \  
--query 'select * from departments where dept_id < 6  AND $CONDITIONS' \      
--target-dir /user/cloudera/sqoop_import/departments;

控制台出错的结果:

When importing query results in parallel, you must specify --split-by

---问题点!---
即使 table 有主键并且拆分可以在 2 个映射器之间平均分配,那么 --spit-by-m 需要什么1 ??

同样指导我。
谢谢。

根据 sqoop docs,

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

因此您必须在 --split-by 标签中指定 主键

如果选择1个mapper,Sqoop不会并行拆分任务,在1个mapper中完成导入。

查看我的另一个 answer如果需要)以了解 $CONDITIONS 的需求和映射器的数量。

它不是 --split-by 用法的主键。由于使用 --query 选项,您看到错误。此选项 必须 在查询中与 --split-by--target-dir$CONDITIONS 一起使用。

free_form_query_imports documentations

When importing a free-form query, you must specify a destination directory with --target-dir.

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

如果您不想使用 --split-by--query:

,您可以使用 --where 选项
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  -P \
  --table departments \
  --target-dir /user/cloudera/departments \
  -m 2 \
  --where "department_id < 6"

如果您使用 --boundary-query 选项,那么您不需要 --split-by--query 选项:

sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  -P \
  --table departments \
  --target-dir /user/cloudera/departments \
  -m 2 \
  --boundary-query "select 2, 6 from departments limit 1" 

selecting_the_data_to_import

By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.

之所以sqoop import在使用--query时需要--split-by是因为在"query"中指定数据源位置时,无法know/guess Sqoop 的主键。因为,在查询中,您可以连接两个或多个具有多个键和字段的表。因此,Sqoop 无法 know/guess 它可以根据哪些键进行拆分。