java.io.IOException:myuser@example.com 从 keytab 登录失败

java.io.IOException: Login failure for myuser@example.com from keytab

我编写了一个程序,使用 spark streaming 将数据插入启用了 kerberos 的 hbase。在一批中,我遇到了一个失败的任务。错误如下:

java.io.IOException: Login failure for myuser@example.com from keytab ./user.keytab
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1160)
    at com.framework.common.HbaseUtil$.InsertToHbase(HbaseUtil.scala:81)
    at com.framework.realtime.RDDUtil$$anonfun$dwsTodwd.apply(RDDUtil.scala:203)
    at com.framework.realtime.RDDUtil$$anonfun$dwsTodwd.apply(RDDUtil.scala:202)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$$anonfun$apply.apply(RDD.scala:920)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$$anonfun$apply.apply(RDD.scala:920)
    at org.apache.spark.SparkContext$$anonfun$runJob.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: javax.security.auth.login.LoginException: Receive timed out
    at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:767)
    at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:584)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at javax.security.auth.login.LoginContext.invoke(LoginContext.java:762)
    at javax.security.auth.login.LoginContext.access[=12=]0(LoginContext.java:203)
    at javax.security.auth.login.LoginContext.run(LoginContext.java:690)
    at javax.security.auth.login.LoginContext.run(LoginContext.java:688)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:687)
    at javax.security.auth.login.LoginContext.login(LoginContext.java:595)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1149)
    ... 13 more
Caused by: java.net.SocketTimeoutException: Receive timed out
    at java.net.PlainDatagramSocketImpl.receive0(Native Method)
    at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:146)
    at java.net.DatagramSocket.receive(DatagramSocket.java:816)
    at sun.security.krb5.internal.UDPClient.receive(NetClient.java:207)
    at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:390)
    at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:343)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.security.krb5.KdcComm.send(KdcComm.java:327)
    at sun.security.krb5.KdcComm.send(KdcComm.java:219)
    at sun.security.krb5.KdcComm.send(KdcComm.java:191)
    at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:319)
    at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:364)
    at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:735)
    ... 25 more

但是在第二次尝试中,任务成功了。在我看来,认证过程太长,所以失败了,而在另一种尝试中,过程很短。所以它成功了。我对么?如果是这样,请问如何解决这个问题? 我的代码如下:

val ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(princ,
      keytab)

    ugi.doAs(new PrivilegedAction[Unit]() {
      def run(): Unit = {
        // TODO Auto-generated method stub
        var conn: HConnection = null
        var htable: HTableInterface = null

          conn = HConnectionManager.createConnection(conf)
          htable = conn.getTable(tableName)
          htable.setAutoFlushTo(false)
          for (record <- partitionOfRecords) {
             htable.put(record)
          }
      }
    })

来自Hadoop and Kerberos - the Madness beyond the Gate章节"Error Messages to Fear"...

Receive timed out

Usually in a stack trace like

Caused by: java.net.SocketTimeoutException: Receive timed out
at java.net.PlainDatagramSocketImpl.receive0(Native Method)
...
at sun.security.krb5.internal.UDPClient.receive(NetClient.java:207)

... UDP socket ... Switch to TCP —at the very least, it will fail faster.

就在那之上:

Switching kerberos to use TCP rather than UDP
In /etc/krb5.conf:

[libdefaults]
udp_preference_limit = 1


一般来说,许多不稳定的 Kerberos 问题似乎只发生在 UDP 上,所以不幸的是它被默认使用...


注意Java也支持kdc_timeout配置参数,但是乱七八糟:

  • MIT Kerberos documentation
  • 中未提及
  • Unix/Linux 文档中未提及 except for BSD
  • 仅在 darkest corners of Java documentation, here for Java 9 中提及,并附有一个有趣的旁注,即默认值在某些时候已从 30s-expressed-implicitly-in-milliseconds 更改为 30s
  • 几周前,Cloudera 支持团队发布了关于该设置的建议——因为 30 秒的默认超时可能会在 HDFS 高可用性或类似的东西中造成级联故障—— - 但可怜的家伙们并不知道他们在推荐什么,所以他们随机建议“3”或“3s”或“3000”作为显式超时值


另请注意,如果您有 多个 KDC 以实现高可用性,并且这些 KDC 在 krb5.conf 中明确列出(或通过例如,使用循环规则设置的 DNS 别名)然后在 "KDC timeout" Java 的情况下应该使用下一个 KDC 重试。除非你已经达到全局超时。