如何使用 Sqoop 将 parquet 数据从 S3 导入到 HDFS？

Question

我正在尝试将数据导入 RDS 中的 table。数据采用 parquet 文件格式并存在于 s3 中。我想使用 Sqoop 将数据从 s3 导入 HDFS，然后使用 Sqoop 将其导出到 RDS table。我能够找到将数据从 HDFS 导出到 RDS 的命令。但是我找不到从 S3 导入镶木地板数据的方法。在这种情况下，您能否帮助构建 sqoop import 命令。

Answer 1

您可以使用 spark 将数据从 s3 复制到 HDFS。

阅读 this 博客了解更多详情。

Answer 2

我认为最简单且最适合我的方法如下：

在 Hive 中创建一个 Parquet table 并使用来自 S3 的 Parquet 数据加载它

create external table if not exists parquet_table(<column name> <column's datatype>) stored as parquet;

LOAD DATA INPATH 's3a://<bucket_name>/<parquet_file>' INTO table parquet_table

在 Hive 中创建 CSV table 并使用 Parquet table

create external table if not exists csv_table(<column name> <column's datatype>)
row format delimited fields terminated by ','
stored as textfile
location 'hdfs:///user/hive/warehouse/csvdata'

现在我们在 Hive 中有了 CSV/Textfile Table，Sqoop 可以轻松地将 table 从 HDFS 导出到 MySQL table RDS。

export --table <mysql_table_name> --export-dir hdfs:///user/hive/warehouse/csvdata --connect jdbc:mysql://<host>:3306/<db_name> --username <username> --password-file hdfs:///user/test/mysql.password --batch -m 1 --input-null-string "\N" --input-null-non-string "\N" --columns <column names to be exported, without whitespace in between the column names>

如何使用 Sqoop 将 parquet 数据从 S3 导入到 HDFS？

How to import parquet data from S3 into HDFS using Sqoop?

hadoop

amazon-s3

hdfs

sqoop

parquet