hadoop streaming，使用-libjars 包含jar 文件

Question

我正在学习 hadoop，并编写了 map/reduce 个步骤来处理我拥有的一些 avro 文件。我认为我遇到的问题可能是由于我的 hadoop 安装所致。我正在尝试在我的笔记本电脑上以独立模式进行测试，而不是在分布式集群上。

这是我对运行工作的 bash 调用：

#!/bin/bash

reducer=/home/hduser/python-hadoop/test/reducer.py  
mapper=/home/hduser/python-hadoop/test/mapper.py
avrohdjar=/home/hduser/python-hadoop/test/avro-mapred-1.7.4-hadoop1.jar
avrojar=/home/hduser/hadoop/share/hadoop/tools/lib/avro-1.7.4.jar


hadoop jar ~/hadoop/share/hadoop/tools/lib/hadoop-streaming* \
  -D mapreduce.job.name="hd1" \
  -libjars ${avrojar},${avrohdjar} \ 
  -files   ${avrojar},${avrohdjar},${mapper},${reducer} \
  -input   ~/tmp/data/* \
  -output  ~/tmp/data-output \
  -mapper  ${mapper} \
  -reducer ${reducer} \
  -inputformat org.apache.avro.mapred.AvroAsTextInputFormat

这是输出：

15/04/23 11:02:54 INFO Configuration.deprecation: session.id is
deprecated. Instead, use dfs.metrics.session-id
15/04/23 11:02:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/04/23 11:02:54 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/04/23 11:02:54 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/home/hduser/tmp/mapred/staging/hduser1337717111/.staging/job_local1337717111_0001
15/04/23 11:02:54 ERROR streaming.StreamJob: Error launching job , bad input path : File does not exist: hdfs://localhost:54310/home/hduser/hadoop/share/hadoop/tools/lib/avro-1.7.4.jar
Streaming Command Failed!

我尝试了很多不同的修复方法，但不知道下一步该尝试什么。由于某些原因，hadoop 无法找到 -libjars 指定的 jar 文件。此外，我已经成功地运行发布了 here 的 wordcount 示例，所以我的 hadoop 安装或配置工作得很好。谢谢！

编辑下面是我的hdfs的内容变化-site.xml

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
   The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

这里是核心-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hduser/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

Answer 1

您的集群运行处于分布式模式。它正在尝试在以下路径中查找输入，但该路径不存在。

hdfs://localhost:54310/home/hduser/hadoop/share/hadoop/tools/lib/avro-1.7.4.jar

hadoop streaming，使用-libjars 包含jar 文件

hadoop streaming, using -libjars to include jar files

python

java

hadoop

hadoop-streaming