从客户端程序访问配置为高可用性的 HDFS

Accessing HDFS configured as High availability from Client program

我试图了解在 HDFS 集群外部通过名称服务(连接活动名称节点 - 高可用性名称节点)连接 HDFS 的工作程序和非工作程序的上下文。

不工作的程序:

当我读取两个配置文件(core-site.xml 和 hdfs-site.xml)并访问 HDFS 文件时,它会抛出一个错误

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

object HadoopAccess {

    def main(args: Array[String]): Unit ={
      val hadoopConf = new Configuration(false)
      val coreSiteXML = "C:\Users\507\conf\core-site.xml"
      val HDFSSiteXML = "C:\Users\507\conf\hdfs-site.xml"
      hadoopConf.addResource(new Path("file:///" + coreSiteXML))
      hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
      println("hadoopConf : " + hadoopConf.get("fs.defaultFS"))

      val fs = FileSystem.get(hadoopConf)
      val check = fs.exists(new Path("/apps/hive"));
//println("Checked : "+ check)

 }

 }

错误:我们看到未知主机异常

hadoopConf :

hdfs://mycluster
Configuration: file:/C:/Users/64507/conf/core-site.xml, file:/C:/Users/64507/conf/hdfs-site.xml
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172)
    at HadoopAccess$.main(HadoopAccess.scala:28)
    at HadoopAccess.main(HadoopAccess.scala)
Caused by: java.net.UnknownHostException: mycluster

工作程序:我专门将高可用性设置为 hadoopConf 对象并传递给 Filesystem 对象,程序工作

    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs.{FileSystem, Path}

    object HadoopAccess {

    def main(args: Array[String]): Unit ={
    val hadoopConf = new Configuration(false)
    val coreSiteXML = "C:\Users\507\conf\core-site.xml"
    val HDFSSiteXML = "C:\Users\507\conf\hdfs-site.xml"
    hadoopConf.addResource(new Path("file:///" + coreSiteXML))
    hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
   

    hadoopConf.set("fs.defaultFS", hadoopConf.get("fs.defaultFS"))
    //hadoopConf.set("fs.defaultFS", "hdfs://mycluster")
    //hadoopConf.set("fs.default.name", hadoopConf.get("fs.defaultFS"))
    hadoopConf.set("dfs.nameservices", hadoopConf.get("dfs.nameservices"))
    hadoopConf.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
    hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn1", "namenode1:8020")
    hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn2", "namenode2:8020")
    hadoopConf.set("dfs.client.failover.proxy.provider.mycluster", 
    "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
    println(hadoopConf)
    /* val namenode = hadoopConf.get("fs.defaultFS")

    println("namenode: "+ namenode) */

    val fs = FileSystem.get(hadoopConf)
   val check = fs.exists(new Path("hdfs://mycluster/apps/hive"));
    //println("Checked : "+ check)

     }

     }

为什么我们需要在 hadoopconf 对象中为 dfs.nameservices,fs.client.failover.proxy.provider.mycluster,dfs.namenode.rpc-address.mycluster.nn1 这样的配置设置值,因为这个值已经存在于 hdfs-site.xml 文件和 core-site.xml 中。这些配置是高可用性 Namenode 设置。

以上程序是我运行通过Edge模式或本地IntelliJ

Hadoop 版本:2.7.3.2 霍顿工厂:2.6.1

我对 Spark Scala REPL 的观察:

当我执行 val hadoopConf = new Configuration(false)val fs = FileSystem.get(hadoopConf) 时。这给了我本地文件系统。所以当我执行以下操作时

hadoopConf.addResource(new Path("file:///" + coreSiteXML))
    hadoopConf.addResource(new Path("file:///" + HDFSSiteXML)) 

,现在文件系统更改为 DFSFileSysyem ..我的假设是 Spark 中的某些客户端库在构建期间或边缘节点公共位置的某个地方不可用。

some client library which is in Spark that is not available in somewhere in during build or edge node common place

这个常见的位置是 $SPARK_HOME/conf and/or $HADOOP_CONF_DIR。但是,如果您只是 运行 带有 java jar 或 IntelliJ 的常规 Scala 应用程序,那与 Spark 无关。

... this values already present in hdfs-site.xml file and core-site.xml

然后,应该相应地阅读它们,但是在代码中重写也不会有什么坏处。

这些值是必需的,因为它们指示实际名称节点所在的位置运行;否则,它认为 mycluster 是只有一台服务器的真实 DNS 名称,而实际上它不是