在 Dataproc 上尝试 运行 Gobblin 时出现 NoSuchMethodError

NoSuchMethodError when trying to run Gobblin on Dataproc

我正在 运行 Gobblin Google Dataproc 上尝试,但我遇到了这个 NoSuchMethodError 并且不知道如何解决。

Waiting for job output...
...
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        ...
Caused by: java.lang.NoSuchMethodError: org.apache.commons.cli.Option.builder()Lorg/apache/commons/cli/Option$Builder;
        at gobblin.runtime.cli.CliOption
        ...

同样的工作(下面的内容)运行在我的本地 hadoop 设置(在我的笔记本电脑上)上很好,但在 dataproc 上却不行。有人曾尝试 运行在 Dataproc 上使用 Gobblin 吗?

这是我的 gobblin 工作文件:

job.name=kafka2gcs
job.group=gkafka2gcs
job.description=Gobblin job to read messages from Kafka and save as is on GCS
job.lock.enabled=false

kafka.brokers=mykafka:9092
topic.whitelist=mytopic
bootstrap.with.offset=earliest

source.class=gobblin.source.extractor.extract.kafka.KafkaDeserializerSource
kafka.deserializer.type=BYTE_ARRAY
extract.namespace=nskafka2gcs

writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer.destination.type=HDFS
mr.job.max.mappers=2
writer.output.format=txt
data.publisher.type=gobblin.publisher.BaseDataPublisher
metrics.enabled=false

fs.uri=file:///.
writer.fs.uri=${fs.uri}
mr.job.root.dir=gobblin
writer.output.dir=${mr.job.root.dir}/out
writer.staging.dir=${mr.job.root.dir}/stg

fs.gs.project.id=my-test-project
data.publisher.fs.uri=gs://my-bucket
state.store.fs.uri=${data.publisher.fs.uri}
data.publisher.final.dir=gobblin/pub
state.store.dir=gobblin/state

这些是我为 dataproc 发出的命令:

gcloud dataproc clusters create myspark \
  --image-version 1.1 \
  --master-machine-type n1-standard-4 \
  --master-boot-disk-size 10 \
  --num-workers 2 \
  --worker-machine-type n1-standard-4 \
  --worker-boot-disk-size 10 
gcloud dataproc jobs submit hadoop --cluster=myspark \
  --class gobblin.runtime.mapreduce.CliMRJobLauncher \
  --jars /opt/gobblin-dist/lib/gobblin-runtime-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-api-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-avro-json-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-codecs-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-core-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-core-base-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-crypto-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-crypto-provider-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-data-management-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-metastore-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-metrics-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-metrics-base-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-metadata-0.10.0.jar,/opt/gobblin-dist/lib/gobblin-utility-0.10.0.jar,/opt/gobblin-dist/lib/avro-1.8.1.jar,/opt/gobblin-dist/lib/avro-mapred-1.8.1.jar,/opt/gobblin-dist/lib/commons-lang3-3.4.jar,/opt/gobblin-dist/lib/config-1.2.1.jar,/opt/gobblin-dist/lib/data-2.6.0.jar,/opt/gobblin-dist/lib/gson-2.6.2.jar,/opt/gobblin-dist/lib/guava-15.0.jar,/opt/gobblin-dist/lib/guava-retrying-2.0.0.jar,/opt/gobblin-dist/lib/joda-time-2.9.3.jar,/opt/gobblin-dist/lib/javassist-3.18.2-GA.jar,/opt/gobblin-dist/lib/kafka_2.11-0.8.2.2.jar,/opt/gobblin-dist/lib/kafka-clients-0.8.2.2.jar,/opt/gobblin-dist/lib/metrics-core-2.2.0.jar,/opt/gobblin-dist/lib/metrics-core-3.1.0.jar,/opt/gobblin-dist/lib/metrics-graphite-3.1.0.jar,/opt/gobblin-dist/lib/scala-library-2.11.8.jar,/opt/gobblin-dist/lib/influxdb-java-2.1.jar,/opt/gobblin-dist/lib/okhttp-2.4.0.jar,/opt/gobblin-dist/lib/okio-1.4.0.jar,/opt/gobblin-dist/lib/retrofit-1.9.0.jar,/opt/gobblin-dist/lib/reflections-0.9.10.jar \
  --properties mapreduce.job.user.classpath.first=true \
  -- -jobconfig gs://my-bucket/gobblin-kafka-gcs.job

我已经尝试在 dataproc 集群的所有机器上复制 /usr/lib/hadoop/lib 内的所有 gobblins lib jar,但它也没有用。

有什么想法吗?

gobblin 0.10.0
hadoop 2.7.3
dataproc image 1.1

Hadoop 发行版可能将其 "commons-cli" 版本泄漏到您的类路径中,并与编译 Gobblin 的版本冲突。 Gobblin appears to depend on commons-cli 1.3.1 and Hadoop 2.7.3 is on 1.2.

通常,如果这些依赖项来自您自己的应用程序,您会使用类似 Maven shade plugin 的东西。如果您从源代码构建 Gobblin,您可以查看它是否使用 commons-cli 1.2 编译,或者它是否实际上是一个硬依赖。

如果commons-cli 1.3.1完全向后兼容,你可以尝试删除 /usr/lib/hadoop/lib/commons-cli-1.2.jar 在您的集群上并添加您自己下载的 commons-cli-1.3.1.jar