在 Sqoop 中,文件导入,我想使用定义的映射器控制文件拆分中的导入数据
In Sqoop, file import, I would like to control the imported data within file splits using defined mappers
MySQL -> select * 来自员工
empno | empname | salary
======================================================
| 101 | Ram | 5000 |
| 102 | Hari | 7000 |
| 104 | Vamshi | 7000 |
| 103 | Revathy | 7000 |
| 105 | Jaya | 9000 |
| 106 | Suresh | 8000 |
| 107 | Ramesh | 9000 |
| 108 | Prasana | 10000 |
| 109 | Ramsamy | 20000 |
| 110 | Singaram | 30000 |
| 200 | ramanathan | 30000 |
| 201 | Victor | 33000 |
| 202 | Naveen | 33000 |
| 203 | Karthik | 33000 |
| 204 | Karthikeyan | 33000 |
| 205 | Somasundaram | 43000 |
| 301 | Test1 | 50000 |
| 302 | Test2 | 60000 |
| 303 | Test3 | 70000
Command in Sqoop
sqoop import --connect jdbc:mysql://<hostname>/test --username <username> --password <password> --table employee
--direct --verbose
--split-by salary
By giving above command, it takes min(salary), max(salary) and moves to HDFS table by 10 records in first file,
3 records in second file,
3 records in third file & 3 records in last file
5/07/03 17:32:37 INFO db.DataDrivenDBInputFormat:
BoundingValsQuery: SELECT MIN(`salary`), MAX(`salary`) FROM employee
15/07/03 17:32:37 DEBUG db.IntegerSplitter: Splits: [
5,000 to 70,000] into 4 parts
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 5,000
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 21,250
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 37,500
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 53,750
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 70,000
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 5000' and upper bound '`salary` < 21250'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 21250' and upper bound '`salary` < 37500'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 37500' and upper bound '`salary` < 53750'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 53750' and upper bound '`salary` <= 70000'
15/07/03 17:32:37 INFO mapreduce.JobSubmitter: number of splits:4
我想知道它是如何对每个文件中的记录数进行分类的。可以定制吗
工资范围是5000 - 70000
(i.e. min 5000, max 70000)
。薪水分为4份薪水类。
(70000 - 5000 )/4=16250
因此,
split 1 : from 5000 to 21,250(=5000+16250)
split 2 : from 21250 to 37500(=21250+16250)
split 3 : from 37500 to 53750(=37500+16250)
split 4 : from 53750 to 70000(=53750+16250)
MySQL -> select * 来自员工
empno | empname | salary
======================================================
| 101 | Ram | 5000 |
| 102 | Hari | 7000 |
| 104 | Vamshi | 7000 |
| 103 | Revathy | 7000 |
| 105 | Jaya | 9000 |
| 106 | Suresh | 8000 |
| 107 | Ramesh | 9000 |
| 108 | Prasana | 10000 |
| 109 | Ramsamy | 20000 |
| 110 | Singaram | 30000 |
| 200 | ramanathan | 30000 |
| 201 | Victor | 33000 |
| 202 | Naveen | 33000 |
| 203 | Karthik | 33000 |
| 204 | Karthikeyan | 33000 |
| 205 | Somasundaram | 43000 |
| 301 | Test1 | 50000 |
| 302 | Test2 | 60000 |
| 303 | Test3 | 70000
Command in Sqoop
sqoop import --connect jdbc:mysql://<hostname>/test --username <username> --password <password> --table employee
--direct --verbose
--split-by salary
By giving above command, it takes min(salary), max(salary) and moves to HDFS table by 10 records in first file,
3 records in second file,
3 records in third file & 3 records in last file
5/07/03 17:32:37 INFO db.DataDrivenDBInputFormat:
BoundingValsQuery: SELECT MIN(`salary`), MAX(`salary`) FROM employee
15/07/03 17:32:37 DEBUG db.IntegerSplitter: Splits: [
5,000 to 70,000] into 4 parts
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 5,000
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 21,250
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 37,500
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 53,750
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 70,000
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 5000' and upper bound '`salary` < 21250'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 21250' and upper bound '`salary` < 37500'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 37500' and upper bound '`salary` < 53750'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 53750' and upper bound '`salary` <= 70000'
15/07/03 17:32:37 INFO mapreduce.JobSubmitter: number of splits:4
我想知道它是如何对每个文件中的记录数进行分类的。可以定制吗
工资范围是5000 - 70000
(i.e. min 5000, max 70000)
。薪水分为4份薪水类。
(70000 - 5000 )/4=16250
因此,
split 1 : from 5000 to 21,250(=5000+16250)
split 2 : from 21250 to 37500(=21250+16250)
split 3 : from 37500 to 53750(=37500+16250)
split 4 : from 53750 to 70000(=53750+16250)