尽管 table 中有主键，但导入时 Sqoop --split-by 错误

Question

MySQL table 以 dept_id 作为主键

|dept_id | dept_name |  
| 2 | Fitness   
| 3 | Footwear  
| 4 | Apparel   
| 5 | Golf      
| 6 | Outdoors  
| 7 | Fan Shop

Sqoop 查询

sqoop import \  
-m 2 \  
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \  
--username retail_dba \  
-P \  
--query 'select * from departments where dept_id < 6  AND $CONDITIONS' \      
--target-dir /user/cloudera/sqoop_import/departments;

控制台出错的结果：

When importing query results in parallel, you must specify --split-by

---问题点！---
即使 table 有主键并且拆分可以在 2 个映射器之间平均分配，那么 --spit-by 或 -m 需要什么1 ??

同样指导我。
谢谢。

Answer 1

根据 sqoop docs,

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

因此您必须在 --split-by 标签中指定主键。

如果选择1个mapper，Sqoop不会并行拆分任务，在1个mapper中完成导入。

查看我的另一个 answer（如果需要）以了解 $CONDITIONS 的需求和映射器的数量。

Answer 2

它不是 --split-by 用法的主键。由于使用 --query 选项，您看到错误。此选项必须在查询中与 --split-by、--target-dir 和 $CONDITIONS 一起使用。

free_form_query_imports documentations

When importing a free-form query, you must specify a destination directory with --target-dir.

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

如果您不想使用 --split-by 和 --query:

，您可以使用 --where 选项

sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  -P \
  --table departments \
  --target-dir /user/cloudera/departments \
  -m 2 \
  --where "department_id < 6"

如果您使用 --boundary-query 选项，那么您不需要 --split-by，--query 选项：

sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  -P \
  --table departments \
  --target-dir /user/cloudera/departments \
  -m 2 \
  --boundary-query "select 2, 6 from departments limit 1"

selecting_the_data_to_import

By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.

Answer 3

之所以sqoop import在使用--query时需要--split-by是因为在"query"中指定数据源位置时，无法know/guess Sqoop 的主键。因为，在查询中，您可以连接两个或多个具有多个键和字段的表。因此，Sqoop 无法 know/guess 它可以根据哪些键进行拆分。

尽管 table 中有主键，但导入时 Sqoop --split-by 错误

Sqoop --split-by error while importing despite of having primary key in table

hadoop

hdfs

sqoop

hadoop2

cloudera-quickstart-vm