AccessControlException: Client cannot authentication via:[TOKEN, KERBEROS] 使用Hive仓库时

Question

我们最近在我们的Spark集群上启用了Kerberos认证，但是我们发现当我们在集群模式下提交Spark作业时，代码无法连接到Hive。我们是否应该使用 Kerberos 对 Hive 进行身份验证，如果是，如何？如下所述，我认为我们必须指定 keytab 和 principal，但我不知道具体是什么。

这是我们得到的异常：

Traceback (most recent call last):
  File "/mnt/resource/hadoop/yarn/local/usercache/sa-etl/appcache/application_1649255698304_0003/container_e01_1649255698304_0003_01_000001/__pyfiles__/utils.py", line 222, in use_db
    spark.sql("CREATE DATABASE IF NOT EXISTS `{db}`".format(db=db))
  File "/usr/hdp/current/spark3-client/python/pyspark/sql/session.py", line 723, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/hdp/current/spark3-client/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/hdp/current/spark3-client/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: java.lang.RuntimeException: java.io.IOException: DestHost:destPort hn1-pt-dev.MYREALM:8020 , LocalHost:localPort wn1-pt-dev/10.208.3.12:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

另外，我看到了这个异常：

org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS], while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over hn0-pt-dev.myrealm/10.208.3.15:8020

这是产生异常的脚本，如您所见，发生在 CREATE DATABASE:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').enableHiveSupport().getOrCreate()
spark.sql("CREATE DATABASE IF NOT EXISTS TestDb")

环境及相关信息

我们在 Azure 中有一个启用了 ESP 的 HDInsight 集群，它位于一个虚拟网络中。 AADDS 可以很好地登录到集群。集群连接到一个存储帐户，与 ABFS 通信并将 Hive 仓库存储在那里。我们正在使用纱线。我们想使用来自 Azure 数据工厂的 PySpark 执行 Spark 作业，它使用 Livy，但如果我们能让它与 spark-submit cli 一起工作，它也有望与 Livy 一起工作。我们正在使用 Spark 3.1.1 和 Kerberos 1.10.3-30。

只有在我们使用spark-submit --deploy-mode cluster时才会出现异常，使用client模式时不会出现异常，并且会创建数据库。

当我们删除 .enableHiveSupport 时，异常也会消失，所以它显然与 Hive 的身份验证有关。不过，我们确实需要 Hive 仓库，因为我们需要从多个 Spark 会话中访问表，因此需要持久化它们。

我们可以在集群模式下访问 HDFS，因为 sc.textFile('/example/data/fruits.txt').collect() 工作正常。

类似问题和可能的解决方案

在异常中，我看到是工作节点试图访问头节点。端口是 8020，我认为这是 namenode 端口，所以这听起来确实与 HDFS 相关——除了据我了解我们可以访问 HDFS，但不能访问 Hive。

https://spark.apache.org/docs/latest/running-on-yarn.html#kerberos提示明确指定principal和keytab文件，所以我找到了带klist -k的keytab文件添加到spark-submit命令行--principal myusername@MYREALM --keytab /etc/krb5.keytab，是一样的keytab 文件作为下面的链接问题之一，但是我得到了

Exception in thread "main" org.apache.hadoop.security.KerberosAuthException: failure to login: for principal: myusername@MYREALM from keytab /etc/krb5.keytab javax.security.auth.login.LoginException: Unable to obtain password from user

也许我的 keytab 文件有误，因为当我 klist -k /etc/krb5.keytab 文件时，我只得到包含 HN0-PT-DEV@MYREALM 和 host/hn0-pt-dev.myrealm@MYREALM 等条目的插槽。如果我在 /etc/security/keytabs 中查看 hdfs/hive 的密钥表，我也只会看到 hdfs/hive 用户的条目。

当我尝试添加 How to use Apache Spark to query Hive table with Kerberos? 中指定的所有 extraJavaOptions 但未指定 principal/keytab 时，我得到 KrbException: Cannot locate default realm，即使 /etc/krb5.conf 中的默认领域是正确的.

在 Ambari 中，我可以看到设置 spark.yarn.keytab={{hive_kerberos_keytab}} 和 spark.yarn.principal={{hive_kerberos_principal}}。

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-faq#how-do-i-create-a-keytab-for-an-hdinsight-esp-cluster- 我为我的用户创建了一个密钥表并指定了该文件，但这没有帮助。

似乎许多其他 answers/websites 也建议明确指定 principal/keytab：

Spark on YARN + Secured hbase 对于 HBase 而不是 Hive，但结论相同。
https://www.ibm.com/docs/en/spectrum-conductor/2.4.1?topic=ssbaig-submitting-spark-batch-applications-kerberos-enabled-hdfs-keytab
Issue with Spark Java API, Kerberos, and Hive
https://docs.cloudera.com/documentation/enterprise/5-7-x/topics/sg_spark_auth.html#concept_bvc_pcy_dt（我找不到 Microsoft 的类似文档）
spark-submit,Client cannot authenticate via:[TOKEN, KERBEROS];

其他问题：

https://spark.apache.org/docs/2.1.1/running-on-yarn.html#running-in-a-secure-cluster 从官方文档开始：它解释说

For a Spark application to interact with HDFS, HBase and Hive, it must acquire the relevant tokens using the Kerberos credentials of the user launching the application —that is, the principal whose identity will become that of the launched Spark application. This is normally done at launch time: in a secure cluster Spark will automatically obtain a token for the cluster’s HDFS filesystem, and potentially for HBase and Hive.

嗯，启动应用程序的用户有有效的票证，如 klist 的输出所示。用户对 blob 存储具有贡献者访问权限（不确定是否确实需要）。不过，我不明白“Spark 将自动为 Hive [在启动时] 获取令牌”是什么意思。我确实重启了集群上的所有服务，但这没有帮助。

Kerberos authentication with Hadoop cluster from Spark stand alone cluster running on Kubernetes cluster 这是两个集群的情况。如此处解释：

in yarn-cluster mode, the Spark client uses the local Kerberos ticket to connect to Hadoop services and retrieve special auth tokens that are then shipped to the YARN container running the driver; then the driver broadcasts the token to the executors

When running Spark on Kubernetes to access kerberized Hadoop cluster, how do you resolve a "SIMPLE authentication is not enabled" error on executors? 对于较旧的 Spark 版本。
Cannot connect to HIVE with Secured kerberos. I am using UserGroupInformation.loginUserFromKeytab() 关于 JAAS
Spark-submit job fails on yarn nodemanager with error Client cannot authenticate via:[TOKEN, KERBEROS] 无人接听
对我来说没有意义。
Hive is not accessible via Spark In Kerberos Environment : Client cannot authenticate via:[TOKEN, KERBEROS] 添加 spark.security.credentials.hadoopfs.enabled=true
https://funclojure.tumblr.com/post/155129283948/hdfs-kerberos-java-client-api-pains 关于罐子
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] Issue没有回答
https://issues.apache.org/jira/browse/SPARK-27554没有回答
旧

可能的尝试：

https://spark.apache.org/docs/2.1.1/running-on-yarn.html#troubleshooting-kerberos 启用更详细的日志记录。
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-linux-ambari-ssh-tunnel 查看 Namenode UI 可能会提供一些信息

更新

当以 Hive 用户身份登录时：

kinit 然后提供 hive 密码：

Password for hive/hn0-pt-dev.myrealm@MYREALM: 
kinit: Password incorrect while getting initial credentials


hive@hn0-pt-dev:/tmp$ klist -k /etc/security/keytabs/hive.service.keytab
Keytab name: FILE:/etc/security/keytabs/hive.service.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   0 hive/hn0-pt-dev.myrealm@MYREALM
   0 hive/hn0-pt-dev.myrealm@MYREALM
   0 hive/hn0-pt-dev.myrealm@MYREALM
   0 hive/hn0-pt-dev.myrealm@MYREALM
   0 hive/hn0-pt-dev.myrealm@MYREALM
hive@hn0-pt-dev:/tmp$ kinit -k /etc/security/keytabs/hive.service.keytab
kinit: Client '/etc/security/keytabs/hive.service.keytab@MYREALM' not found in Kerberos database while getting initial credentials

Answer 1

一般来说，您必须完成 [kinit 成功]/[通过 principle/keytab] 才能将 Kerberos 与 spark/hive 一起使用。它们是一些使配置单元的使用复杂化的设置。（模仿）

一般来说，如果您可以 kinit 并使用 hdfs 写入您自己的文件夹，您的 keytab 就可以工作：

kinit #enter user info
hdfs dfs -touch /home/myuser/somefile #gurantees you have a home directory... spark needs this

一旦你知道它正在工作，你应该检查你是否可以写入配置单元：

要么使用 JDBC 连接，要么使用像下面这样的连接字符串的直线

jdbc:hive2://HiveHost:10001/default;principal=myuser@HOST1.COM;

这有助于找出问题所在。

如果您正在查看配置单元的问题，您需要 check impersonation:

HiveServer2 Impersonation Important: This is not the recommended method to implement HiveServer2 authorization. Cloudera recommends you use Sentry to implement this instead. HiveServer2 impersonation lets users execute queries and access HDFS files as the connected user rather than as the super user. Access policies are applied at the file level using the HDFS permissions specified in ACLs (access control lists). Enabling HiveServer2 impersonation bypasses Sentry from the end-to-end authorization process. Specifically, although Sentry enforces access control policies on tables and views within the Hive warehouse, it does not control access to the HDFS files that underlie the tables. This means that users without Sentry permissions to tables in the warehouse may nonetheless be able to bypass Sentry authorization checks and execute jobs and queries against tables in the warehouse as long as they have permissions on the HDFS files supporting the table.

如果您在 windows，您应该注意票证缓存。您应该考虑设置自己的个人票证缓存位置，因为通常 windows 为所有用户使用一个通用位置。（这允许用户在彼此之上登录，从而产生奇怪的错误。）

如果您遇到 Hive 问题，Hive 日志本身通常可以帮助您了解进程无法正常工作的原因。（但是如果某些 kerberos 成功，您将只有一个日志，如果它完全不成功，您将看不到任何东西。）

检查 Ranger，看看是否有任何错误。

如果要使用集群模式访问Hive仓库，需要指定keytab和principal为spark-submit（这个在official docs中有明确说明）

Using a Keytab By providing Spark with a principal and keytab (e.g. using spark-submit with --principal and --keytab parameters), the application will maintain a valid Kerberos login that can be used to retrieve delegation tokens indefinitely.

Note that when using a keytab in cluster mode, it will be copied over to the machine running the Spark driver. In the case of YARN, this means using HDFS as a staging area for the keytab, so it’s strongly recommended that both YARN and HDFS be secured with encryption, at least.

您需要创建您的 own keytab
创建密钥表后，请确保正确的用户对其具有权限，否则您将再次获得 Unable to obtain password from user。

如果您使用的是 Livy --proxy-user 将与 --principal 发生冲突，但这很容易解决。（使用：livy.impersonation.enabled=false）

AccessControlException: Client cannot authentication via:[TOKEN, KERBEROS] 使用Hive仓库时

AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] when using Hive warehouse

hive

kerberos

apache-spark

azure-hdinsight

环境及相关信息

类似问题和可能的解决方案

更新