SQOOP 将 HDFS 导出到 MYSQL db

SQOOP export HDFS to MYSQL db

我正在尝试将 HDFS 导出到 MYSQL 数据库。我找到了各种不同的解决方案,但其中 none 有效,我什至尝试从文件中删除 WINDOWS-1251 个字符。

作为一个小总结 - 我将 virtualbox 与 Hortonworks 图像一起用于此操作。

默认数据库中我的HIVE:

CREATE EXTERNAL TABLE `airqualitydata`(
  `sensor_id` VARCHAR(100),
  `sensor_type` VARCHAR(100), 
  `location` VARCHAR(100), 
  `lat` VARCHAR(100), 
  `lon` VARCHAR(100), 
  `timestamp` timestamp, 
  `p1` VARCHAR(100), 
  `durp1` VARCHAR(100), 
  `ratiop1` VARCHAR(100), 
  `p2` VARCHAR(100), 
  `durp2` VARCHAR(100), 
  `ratiop2` VARCHAR(100))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '3'
LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/hadoop/airqualitydata'
TBLPROPERTIES ("skip.header.line.count"="1");

包含在 /hadoop/airqualitydata HDFS 中的文件(删除了 win1251 字符只是为了确定)。

请注意,可以通过在配置单元中查询 SELECT * FROM airqualitydata 来可视化此数据。

sensor_id;sensor_type;location;lat;lon;timestamp;P1;durP1;ratioP1;P2;durP2;ratioP2
9710;SDS011;4894;43.226;27.934;2021-09-09T00:00:12;70;;;20;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:02:41;83;;;0.93;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:05:14;0.80;;;0.73;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:07:42;0.50;;;0.50;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:10:10;57;;;0.80;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:12:39;0.40;;;0.40;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:15:07;0.70;;;0.70;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:17:35;2;;;0.47;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:20:04;90;;;0.63;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:22:34;0.57;;;0.57;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:25:01;0.73;;;0.60;;

MYSQL数据库&TABLE:

CREATE DATABASE airquality CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
CREATE TABLE `airqualitydata`(
  `sensor_id` VARCHAR(100), 
  `sensor_type` VARCHAR(100), 
  `location` VARCHAR(100), 
  `lat` VARCHAR(100), 
  `lon` VARCHAR(100), 
  `timestamp` timestamp, 
  `p1` VARCHAR(100), 
  `durp1` VARCHAR(100), 
  `ratiop1` VARCHAR(100), 
  `p2` VARCHAR(100), 
  `durp2` VARCHAR(100), 
  `ratiop2` VARCHAR(100)
);

SQOOP CLI 调用:

sqoop export --connect "jdbc:mysql://localhost:3306/airquality?useUnicode=true&characterEncoding=WINDOWS-1251" --username root --password hortonworks1 --export-dir hdfs://sandbox-hdp.hortonworks.com:8020/hadoop/airqualitydata --table airqualitydata --input-fields-terminated-by "3" --input-lines-terminated-by "\n" -m 1

我删除了 ?useUnicode=true&characterEncoding=WINDOWS-1251 但没有成功。 我也无法从终端中给出的 URL 访问日志,所以我只得到这个作为失败:

Warning: /usr/hdp/2.6.5.0-292/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/09/12 04:04:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.6.5.0-292
21/09/12 04:04:40 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/09/12 04:04:40 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/09/12 04:04:40 INFO tool.CodeGenTool: Beginning code generation
Sun Sep 12 04:04:40 UTC 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
21/09/12 04:04:40 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `airqualitydata` AS t LIMIT 1
21/09/12 04:04:40 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `airqualitydata` AS t LIMIT 1
21/09/12 04:04:40 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.6.5.0-292/hadoop-mapreduce
Note: /tmp/sqoop-raj_ops/compile/41fba9933b913b974b70403656a13287/airqualitydata.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
21/09/12 04:04:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-raj_ops/compile/41fba9933b913b974b70403656a13287/airqualitydata.jar
21/09/12 04:04:42 INFO mapreduce.ExportJobBase: Beginning export of airqualitydata
21/09/12 04:04:43 INFO client.RMProxy: Connecting to ResourceManager at sandbox-hdp.hortonworks.com/172.18.0.2:8032
21/09/12 04:04:43 INFO client.AHSProxy: Connecting to Application History server at sandbox-hdp.hortonworks.com/172.18.0.2:10200
21/09/12 04:04:50 INFO input.FileInputFormat: Total input paths to process : 1
21/09/12 04:04:50 INFO input.FileInputFormat: Total input paths to process : 1
21/09/12 04:04:50 INFO mapreduce.JobSubmitter: number of splits:1
21/09/12 04:04:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1631399426919_0028
21/09/12 04:04:51 INFO impl.YarnClientImpl: Submitted application application_1631399426919_0028
21/09/12 04:04:51 INFO mapreduce.Job: The url to track the job: http://sandbox-hdp.hortonworks.com:8088/proxy/application_1631399426919_0028/
21/09/12 04:04:51 INFO mapreduce.Job: Running job: job_1631399426919_0028
21/09/12 04:04:59 INFO mapreduce.Job: Job job_1631399426919_0028 running in uber mode : false
21/09/12 04:04:59 INFO mapreduce.Job:  map 0% reduce 0%
21/09/12 04:05:03 INFO mapreduce.Job:  map 100% reduce 0%
21/09/12 04:05:04 INFO mapreduce.Job: Job job_1631399426919_0028 failed with state FAILED due to: Task failed task_1631399426919_0028_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

21/09/12 04:05:04 INFO mapreduce.Job: Counters: 8
        Job Counters
                Failed map tasks=1
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=2840
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=2840
                Total vcore-milliseconds taken by all map tasks=2840
                Total megabyte-milliseconds taken by all map tasks=710000
21/09/12 04:05:04 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
21/09/12 04:05:04 INFO mapreduce.ExportJobBase: Transferred 0 bytes in 21.2627 seconds (0 bytes/sec)
21/09/12 04:05:04 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
21/09/12 04:05:04 INFO mapreduce.ExportJobBase: Exported 0 records.
21/09/12 04:05:04 ERROR mapreduce.ExportJobBase: Export job failed!
21/09/12 04:05:04 ERROR tool.ExportTool: Error during export: Export job failed!

任何指示都会有所帮助,谢谢!

编辑#1: 根据上面的评论,使用:

sqoop export --connect jdbc:mysql://localhost:3306/airquality  --table airqualitydata  --username root --password hortonworks1 --hcatalog-database default --hcatalog-table airqualitydata --verbose

或者基本上(对于繁殖的人)

sqoop export --connect jdbc:mysql://<host:port>/<mysql db> --table <mysql table> --username <mysql_user> --password <mysqlpass> --hcatalog-database <hive_db> --hcatalog-table <hive_table> --verbose

我得到它把数据放在MYSQL。但是,它也放置了 header 行。此外,当 运行 两次(我相信它应该覆盖数据)时,它会导致数据在 table 中两次。

+-----------+-------------+----------+--------+--------+---------------------+------+-------+---------+------+-------+---------+
| sensor_id | sensor_type | location | lat    | lon    | timestamp           | p1   | durp1 | ratiop1 | p2   | durp2 | ratiop2 |
+-----------+-------------+----------+--------+--------+---------------------+------+-------+---------+------+-------+---------+
| sensor_id | sensor_type | location | lat    | lon    | 2021-09-12 05:55:49 | P1   | durP1 | ratioP1 | P2   | durP2 | ratioP2 |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 70   |       |         | 20   |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 83   |       |         | 0.93 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.80 |       |         | 0.73 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.50 |       |         | 0.50 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 57   |       |         | 0.80 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.40 |       |         | 0.40 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.70 |       |         | 0.70 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 2    |       |         | 0.47 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 90   |       |         | 0.63 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.57 |       |         | 0.57 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.73 |       |         | 0.60 |       |         |
| sensor_id | sensor_type | location | lat    | lon    | 2021-09-12 05:58:02 | P1   | durP1 | ratioP1 | P2   | durP2 | ratioP2 |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 70   |       |         | 20   |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 83   |       |         | 0.93 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.80 |       |         | 0.73 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.50 |       |         | 0.50 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 57   |       |         | 0.80 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.40 |       |         | 0.40 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.70 |       |         | 0.70 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 2    |       |         | 0.47 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 90   |       |         | 0.63 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.57 |       |         | 0.57 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.73 |       |         | 0.60 |       |         |
+-----------+-------------+----------+--------+--------+---------------------+------+-------+---------+------+-------+---------+

HIVE 中的数据没问题(那里没有 header 行)。这可能是什么原因造成的?

我也有一个例外,但它总体上完成了,这很重要吗?

21/09/12 05:57:41 INFO mapreduce.Job: Running job: job_1631399426919_0035
21/09/12 05:57:55 INFO mapreduce.Job: Job job_1631399426919_0035 running in uber mode : false
21/09/12 05:57:55 INFO mapreduce.Job:  map 0% reduce 0%
21/09/12 05:58:03 INFO mapreduce.Job:  map 100% reduce 0%
21/09/12 05:58:05 INFO mapreduce.Job: Job job_1631399426919_0035 completed successfully
21/09/12 05:58:06 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=345759
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2597
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=2
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=4966
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=4966
                Total vcore-milliseconds taken by all map tasks=4966
                Total megabyte-milliseconds taken by all map tasks=1241500
        Map-Reduce Framework
                Map input records=12
                Map output records=12
                Input split bytes=1800
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=211
                CPU time spent (ms)=3490
                Physical memory (bytes) snapshot=217477120
                Virtual memory (bytes) snapshot=1972985856
                Total committed heap usage (bytes)=51380224
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0
21/09/12 05:58:06 INFO mapreduce.ExportJobBase: Transferred 2.5361 KB in 62.3328 seconds (41.6635 bytes/sec)
21/09/12 05:58:06 INFO mapreduce.ExportJobBase: Exported 12 records.
21/09/12 05:58:06 INFO mapreduce.ExportJobBase: Publishing HCatalog export job data to Listeners
21/09/12 05:58:06 WARN mapreduce.PublishJobData: Unable to publish export data to publisher org.apache.atlas.sqoop.hook.SqoopHook
java.lang.ClassNotFoundException: org.apache.atlas.sqoop.hook.SqoopHook
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.apache.sqoop.mapreduce.PublishJobData.publishJobData(PublishJobData.java:46)
        at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJobBase.java:457)
        at org.apache.sqoop.manager.SqlManager.exportTable(SqlManager.java:931)
        at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:81)
        at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:243)
21/09/12 05:58:06 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader@4232c52b

第一个问题的解决方案 - --hcatalog-database mydb --hcatalog-table airquality 并删除 --export dir 参数。

Sqoop 导出无法替换数据。请在加载 main table 之前发出一个 sqoop eval 语句来截断它。

sqoop eval --connect conn_parameters --username xx --password yy --query "truncate table mytab;"

您也可以使用更新语句来更新 table。 https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
现在,对于您的 header 问题,我认为原来的 table 可能有 header 行。我不确定原始 table 中的数据。检查源 table 是否在 hive 中正确定义。