从 Google Cloud Dataproc 提交 Pig 作业不会将自定义 jar 添加到 Pig 类路径

Submitting Pig job from Google Cloud Dataproc does not add custom jars to Pig classpath

我正在尝试通过 Google Cloud Dataproc 提交 Pig 作业并包含一个自定义 jar,该 jar 实现了我在 Pig 脚本中使用的自定义加载函数,但我不知道该怎么做那。

通过 UI 添加我的自定义 jar 不会将其添加到 Pig class 路径。

这是 Pig 作业的输出,显示它找不到我的 class:

17/03/29 16:12:21 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/03/29 16:12:21 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
17/03/29 16:12:21 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2017-03-29 16:12:21,961 [main] INFO  org.apache.pig.Main - Apache Pig version 0.16.0 (r: unknown) compiled Nov 27 2016, 23:14:51
2017-03-29 16:12:21,961 [main] INFO  org.apache.pig.Main - Logging error messages to: /tmp/cb3b0696-3f30-4db4-a6a7-bb716d2a8a89/pig_1490803941959.log
2017-03-29 16:12:22,379 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2017-03-29 16:12:22,379 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-03-29 16:12:22,379 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://aspen-dp-central-m
2017-03-29 16:12:22,404 [main] INFO  com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase - GHFS version: 1.6.0-hadoop2
2017-03-29 16:12:22,890 [main] INFO  org.apache.pig.PigServer - Pig Script ID for the session: PIG-default-e53a2851-efe5-4e74-bf33-89dfe0733386
2017-03-29 16:12:22,890 [main] WARN  org.apache.pig.PigServer - ATS is disabled since yarn.timeline-service.enabled set to false
2017-03-29 16:12:23,247 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Failed to parse: Pig script failed to parse: 
<line 8, column 13> pig script failed to validate: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:199)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1819)
    at org.apache.pig.PigServer$Graph.access[=11=]0(PigServer.java:1527)
    at org.apache.pig.PigServer.parseAndBuild(PigServer.java:460)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:485)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:471)
    at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:172)
    at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:742)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
    at org.apache.pig.Main.run(Main.java:532)
    at org.apache.pig.Main.main(Main.java:176)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: 
<line 8, column 13> pig script failed to validate: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    at org.apache.pig.parser.LogicalPlanBuilder.validateFuncSpec(LogicalPlanBuilder.java:1339)
    at org.apache.pig.parser.LogicalPlanBuilder.buildFuncSpec(LogicalPlanBuilder.java:1324)
    at org.apache.pig.parser.LogicalPlanGenerator.func_clause(LogicalPlanGenerator.java:5184)
    at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3515)
    at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
    at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
    at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
    at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
    ... 19 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:671)
    at org.apache.pig.parser.LogicalPlanBuilder.validateFuncSpec(LogicalPlanBuilder.java:1336)
    ... 27 more
2017-03-29 16:12:23,251 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /tmp/cb3b0696-3f30-4db4-a6a7-bb716d2a8a89/pig_1490803941959.log
2017-03-29 16:12:23,269 [main] INFO  org.apache.pig.Main - Pig script completed in 1 second and 477 milliseconds (1477 ms)
Job output is complete

在 Pig 脚本中注册自定义 jar 可以解决问题。 所以,基本上:

  1. 将我的 jar 文件添加到 Google 存储
  2. 在脚本中注册了jar
  3. 已通过 UI 或以下命令行提交 Pig 作业:

gcloud dataproc 作业提交猪 --cluster eduboom-central --file custom.pig --jars=gs://eduboom-dataproc/custom/eduboom.jar

custom.pig:

register eduboom.jar;
raw = LOAD 'hbase://eduboom_table'
   USING com.eduboom.pig.load.HBaseMultiScanLoader('2017-03-30T14:00Z_00', '2017-03-30T14:01Z_25', 'cf:*')
   AS (key:chararray, data);
DUMP raw;