保存sqoop增量导入id

Question

我在 AWS EMR 中有很多 sqoop 作业运行，但有时我需要关闭此实例。

有一种方法可以从增量导入中保存最后一个 id，也许是在本地，然后通过 cronjob 将其上传到 s3。

我的第一个想法是，当我创建作业时，我只是向存储我的数据的 Redshift 发送请求，并通过 bash 脚本获取最后一个 ID 或 last_modified。

另一种思路是获取sqoop job的输出--show $jobid，过滤last_id的参数，重新创建job

但我不知道 sqoop 是否提供了一种更容易做到这一点的方法。

Answer 1

根据 Sqoop docs,

If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of sqoop job --exec someIncrementalJob will continue to import only newer rows than those previously imported.

因此，您无需存储任何内容。 Sqoop 的 Metastore 将负责保存最后的值并用于下一个增量导入作业。

示例，

sqoop job \
--create new_job \
-- \
import \
--connect jdbc:mysql://localhost/testdb \
--username xxxx \
--password xxxx \
--table employee \
--incremental append \
--check-column id \
--last-value 0

并使用 --exec 参数启动此作业：

sqoop job --exec new_job

Answer 2

解决方案

我更改文件 sqoop-site.xml 并将端点添加到我的 MySQL。

步骤

创建 MySQL 实例并运行此查询： CREATE TABLE SQOOP_ROOT (version INT, propname VARCHAR(128) NOT NULL, propval VARCHAR(256), CONSTRAINT SQOOP_ROOT_unq UNIQUE (version, propname)); 和 INSERT INTO SQOOP_ROOT VALUES(NULL, 'sqoop.hsqldb.job.storage.version', '0');
更改原来的 sqoop-site.xml 添加您的 MySQL 端点、用户和密码。

  <property>
    <name>sqoop.metastore.client.enable.autoconnect</name>
    <value>true</value>
    <description>If true, Sqoop will connect to a local metastore
      for job management when no other metastore arguments are
      provided.
    </description>
  </property>
  

  <!--
    The auto-connect metastore is stored in ~/.sqoop/. Uncomment
    these next arguments to control the auto-connect process with
    greater precision.
  -->
  
  <property>
    <name>sqoop.metastore.client.autoconnect.url</name>
    <value>jdbc:mysql://your-mysql-instance-endpoint:3306/database</value>
    <description>The connect string to use when connecting to a
      job-management metastore. If unspecified, uses ~/.sqoop/.
      You can specify a different path here.
    </description>
  </property>
  <property>
    <name>sqoop.metastore.client.autoconnect.username</name>
    <value>${sqoop-user}</value>
    <description>The username to bind to the metastore.
    </description>
  </property>
  <property>
    <name>sqoop.metastore.client.autoconnect.password</name>
    <value>${sqoop-pass}</value>
    <description>The password to bind to the metastore.
    </description>
  </property>

当您第一次执行命令 sqoop job --list 时，它将 return 零值。但是在创建作业后，如果您关闭 EMR，您不会丢失执行作业的 sqoop 元数据。

在 EMR 中，我们可以使用 Bootstrap 操作在集群创建中自动执行此操作。

保存sqoop增量导入id

Save sqoop incremental import id

bash

sqoop

解决方案

步骤