无法使用 pyarrow' hdfs API 从 Kerberized 集群上的 worker/data 节点连接到 HDFS
Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API
这是我正在尝试的:
import pyarrow as pa
conf = {"hadoop.security.authentication": "kerberos"}
fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf)
但是,当我使用 Dask-YARN
将此作业提交到集群时,出现以下错误:
File "test/run.py", line 3
fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf)
File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 211, in connect
File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 38, in __init__
File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed
我也尝试设置 host (to a name node)
和 port (=8020)
,但是我 运行 出现了同样的错误。由于错误不是描述性的,我不确定需要更改哪个设置。有任何线索吗?
通常会自动加载配置和 kerberos 票据,您应该可以使用
进行连接
fs = pa.hdfs.connect()
一个人。这确实需要您已经调用 kinit
(在工作节点上,凭据(但 而不是 票证)会自动传输到工作环境,无需执行任何操作)。我建议在本地尝试不带参数,然后在工作节点上尝试。
这是我正在尝试的:
import pyarrow as pa
conf = {"hadoop.security.authentication": "kerberos"}
fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf)
但是,当我使用 Dask-YARN
将此作业提交到集群时,出现以下错误:
File "test/run.py", line 3
fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf)
File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 211, in connect
File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 38, in __init__
File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed
我也尝试设置 host (to a name node)
和 port (=8020)
,但是我 运行 出现了同样的错误。由于错误不是描述性的,我不确定需要更改哪个设置。有任何线索吗?
通常会自动加载配置和 kerberos 票据,您应该可以使用
进行连接fs = pa.hdfs.connect()
一个人。这确实需要您已经调用 kinit
(在工作节点上,凭据(但 而不是 票证)会自动传输到工作环境,无需执行任何操作)。我建议在本地尝试不带参数,然后在工作节点上尝试。