增量追加到文件

Question

我在 MySQL 中有 table 我想使用 sqoop 导入数据。我导入数据并存储在 HDFS 中作为 file。现在我想运行对此进行增量更新 hdfs 中的文件。

假设我在 MYSQL table 中有 100 行。 HDFS 中的文件包含前 50 行的数据。我怎样才能增量更新这个文件。

I am talking about files not Hive tables。

I want incremental data as a separate file not merged file. For example the first part file contains 50 records, then I need a part file that contains the next 50 records. I mean to say can we do incremental update on files?

Answer 1

在这种情况下您不能更新 HDFS 文件。

但这是一个常见的用例。 sqoop-merge 工具可以解决这个问题。您需要执行 sqoop 增量导入并将输出保存在不同的 hdfs 文件中。

根据documentation,

The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.

示例命令：

sqoop merge --new-data newer --onto older --target-dir merged \
--jar-file datatypes.jar --class-name Foo --merge-key id

增量追加到文件

Incremental append to file

hive

increment

hdfs

sqoop