Dataproc PySpark Worker 无权使用 gsutil
Dataproc PySpark Workers Have no Permission to Use gsutil
在 Dataproc 下,我设置了一个具有 1 个主节点和 2 个工作节点的 PySpark 集群。在存储桶中,我有文件子目录的目录。
我在 Datalab 笔记本中 运行
import subprocess
all_parent_direcotry = subprocess.Popen("gsutil ls gs://parent-directories ",shell=True,stdout=subprocess.PIPE).stdout.read()
这给了我所有的子目录没有问题。
然后我希望gsutil ls
子目录下的所有文件,所以在主节点我得到:
def get_sub_dir(path):
import subprocess
p = subprocess.Popen("gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
return p.stdout.read(), p.stderr.read()
和运行 get_sub_dir(sub-directory)
,这样所有文件都没有问题。
然而,
sub_dir = sc.parallelize([sub-directory])
sub_dir.map(get_sub_dir).collect()
给我:
Traceback (most recent call last):
File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 99, in <module>
main()
File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 30, in main
project, account = bootstrapping.GetActiveProjectAndAccount()
File "/usr/lib/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 205, in GetActiveProjectAndAccount
project_name = properties.VALUES.core.project.Get(validate=False)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1373, in Get
required)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1661, in _GetProperty
value = _GetPropertyWithoutDefault(prop, properties_file)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1699, in _GetPropertyWithoutDefault
value = callback()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 222, in GetProject
return c_gce.Metadata().Project()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 203, in Metadata
_metadata_lock.lock(function=_CreateMetadata, argument=None)
File "/usr/lib/python2.7/mutex.py", line 44, in lock
function(argument)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 202, in _CreateMetadata
_metadata = _GCEMetadata()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 59, in __init__
self.connected = gce_cache.GetOnGCE()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 141, in GetOnGCE
return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 81, in GetOnGCE
self._WriteDisk(on_gce)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 113, in _WriteDisk
with files.OpenForWritingPrivate(gce_cache_path) as gcecache_file:
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 715, in OpenForWritingPrivate
MakeDir(full_parent_dir_path, mode=0700)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 115, in MakeDir
(u'Please verify that you have permissions to write to the parent '
googlecloudsdk.core.util.files.Error: Could not create directory [/home/.config/gcloud]: Permission denied.
Please verify that you have permissions to write to the parent directory.
检查后,在whoami
的工作节点上显示yarn
。
所以问题是,如何授权 yarn
使用 gsutil
,或者是否有任何其他方法可以从 Dataproc PySpark Worker 节点访问存储桶?
当 CLI 从元数据服务获取令牌时,它会查看当前主目录以查找放置缓存凭据文件的位置。 googlecloudsdk/core/config.py
中的相关代码如下所示:
def _GetGlobalConfigDir():
"""Returns the path to the user's global config area.
Returns:
str: The path to the user's global config area.
"""
# Name of the directory that roots a cloud SDK workspace.
global_config_dir = encoding.GetEncodedValue(os.environ, CLOUDSDK_CONFIG)
if global_config_dir:
return global_config_dir
if platforms.OperatingSystem.Current() != platforms.OperatingSystem.WINDOWS:
return os.path.join(os.path.expanduser('~'), '.config',
_CLOUDSDK_GLOBAL_CONFIG_DIR_NAME)
对于 YARN 容器中的东西 运行,尽管作为用户 yarn
运行,如果你只是 运行 sudo su yarn
你会看到~
在 Dataproc 节点上解析为 /var/lib/hadoop-yarn
,YARN 实际上传播 yarn.nodemanager.user-home-dir
作为容器的 homedir,这默认为 /home/
。出于这个原因,即使您可以 sudo -u yarn gsutil ...
,它的行为方式与 YARN 容器中的 gsutil 不同,自然地,只有 root
能够在基础 /home/
中创建目录]目录。
长话短说,您有两个选择:
- 在您的代码中,在
gsutil
语句之前添加 HOME=/var/lib/hadoop-yarn
。
示例:
p = subprocess.Popen("HOME=/var/lib/hadoop-yarn gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
- 创建集群时,指定 YARN 属性。
示例:
gcloud dataproc clusters create --properties yarn:yarn.nodemanager.user-home-dir=/var/lib/hadoop-yarn ...
对于现有集群,您还可以手动将配置添加到所有工作人员的 /etc/hadoop/conf/yarn-site.xml
,然后重新启动工作人员机器(或只是 运行 sudo systemctl restart hadoop-yarn-nodemanager.service
),但这可以在所有工作节点上手动 运行 很麻烦。
在 Dataproc 下,我设置了一个具有 1 个主节点和 2 个工作节点的 PySpark 集群。在存储桶中,我有文件子目录的目录。
我在 Datalab 笔记本中 运行
import subprocess
all_parent_direcotry = subprocess.Popen("gsutil ls gs://parent-directories ",shell=True,stdout=subprocess.PIPE).stdout.read()
这给了我所有的子目录没有问题。
然后我希望gsutil ls
子目录下的所有文件,所以在主节点我得到:
def get_sub_dir(path):
import subprocess
p = subprocess.Popen("gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
return p.stdout.read(), p.stderr.read()
和运行 get_sub_dir(sub-directory)
,这样所有文件都没有问题。
然而,
sub_dir = sc.parallelize([sub-directory])
sub_dir.map(get_sub_dir).collect()
给我:
Traceback (most recent call last):
File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 99, in <module>
main()
File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 30, in main
project, account = bootstrapping.GetActiveProjectAndAccount()
File "/usr/lib/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 205, in GetActiveProjectAndAccount
project_name = properties.VALUES.core.project.Get(validate=False)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1373, in Get
required)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1661, in _GetProperty
value = _GetPropertyWithoutDefault(prop, properties_file)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1699, in _GetPropertyWithoutDefault
value = callback()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 222, in GetProject
return c_gce.Metadata().Project()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 203, in Metadata
_metadata_lock.lock(function=_CreateMetadata, argument=None)
File "/usr/lib/python2.7/mutex.py", line 44, in lock
function(argument)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 202, in _CreateMetadata
_metadata = _GCEMetadata()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 59, in __init__
self.connected = gce_cache.GetOnGCE()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 141, in GetOnGCE
return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 81, in GetOnGCE
self._WriteDisk(on_gce)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 113, in _WriteDisk
with files.OpenForWritingPrivate(gce_cache_path) as gcecache_file:
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 715, in OpenForWritingPrivate
MakeDir(full_parent_dir_path, mode=0700)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 115, in MakeDir
(u'Please verify that you have permissions to write to the parent '
googlecloudsdk.core.util.files.Error: Could not create directory [/home/.config/gcloud]: Permission denied.
Please verify that you have permissions to write to the parent directory.
检查后,在whoami
的工作节点上显示yarn
。
所以问题是,如何授权 yarn
使用 gsutil
,或者是否有任何其他方法可以从 Dataproc PySpark Worker 节点访问存储桶?
当 CLI 从元数据服务获取令牌时,它会查看当前主目录以查找放置缓存凭据文件的位置。 googlecloudsdk/core/config.py
中的相关代码如下所示:
def _GetGlobalConfigDir():
"""Returns the path to the user's global config area.
Returns:
str: The path to the user's global config area.
"""
# Name of the directory that roots a cloud SDK workspace.
global_config_dir = encoding.GetEncodedValue(os.environ, CLOUDSDK_CONFIG)
if global_config_dir:
return global_config_dir
if platforms.OperatingSystem.Current() != platforms.OperatingSystem.WINDOWS:
return os.path.join(os.path.expanduser('~'), '.config',
_CLOUDSDK_GLOBAL_CONFIG_DIR_NAME)
对于 YARN 容器中的东西 运行,尽管作为用户 yarn
运行,如果你只是 运行 sudo su yarn
你会看到~
在 Dataproc 节点上解析为 /var/lib/hadoop-yarn
,YARN 实际上传播 yarn.nodemanager.user-home-dir
作为容器的 homedir,这默认为 /home/
。出于这个原因,即使您可以 sudo -u yarn gsutil ...
,它的行为方式与 YARN 容器中的 gsutil 不同,自然地,只有 root
能够在基础 /home/
中创建目录]目录。
长话短说,您有两个选择:
- 在您的代码中,在
gsutil
语句之前添加HOME=/var/lib/hadoop-yarn
。
示例:
p = subprocess.Popen("HOME=/var/lib/hadoop-yarn gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
- 创建集群时,指定 YARN 属性。
示例:
gcloud dataproc clusters create --properties yarn:yarn.nodemanager.user-home-dir=/var/lib/hadoop-yarn ...
对于现有集群,您还可以手动将配置添加到所有工作人员的 /etc/hadoop/conf/yarn-site.xml
,然后重新启动工作人员机器(或只是 运行 sudo systemctl restart hadoop-yarn-nodemanager.service
),但这可以在所有工作节点上手动 运行 很麻烦。