为什么我不能创建同时安装了 Jupyter 和 DataLab 的 Google DataProc 集群?
Why can't I create a Google DataProc cluster with both Jupyter and DataLab installed?
我想在安装了 Jupyter 和 DataLab 的 DataProc 中创建一个集群(我知道它们非常相似,但团队成员有不同的偏好)。我可以用它们中的任何一个创建集群:
使用 Jupyter 的集群:
gcloud dataproc clusters create $DATAPROC_CLUSTER_NAME_JUPYTER \
--project $PROJECT \
--bucket $BUCKET \
--zone $ZONE \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--metadata JUPYTER_PORT=$JUPYTER_PORT,JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn
DataLab 集群:
gcloud dataproc clusters create $DATAPROC_CLUSTER_NAME_DATALAB \
--project $PROJECT \
--bucket $BUCKET \
--zone $ZONE \
--master-boot-disk-size $MASTER_DISK_SIZE \
--worker-boot-disk-size $WORKER_DISK_SIZE \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--scopes cloud-platform,bigquery
两者都很好用。但是,当我尝试用它们创建一个集群时,它失败了:
gcloud dataproc clusters create test \
--project $PROJECT \
--bucket $BUCKET \
--zone $ZONE \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh,gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--metadata JUPYTER_PORT=$JUPYTER_PORT,JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn \
--scopes cloud-platform,bigquery
错误信息是:
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/abc/regions/global/operations/d34943dc-5bda-386f-af91-db6e0516e2c5] failed: Multiple Errors:
- Initialization action failed. Failed action 'gs://dataproc-initialization-actions/jupyter/jupyter.sh', see output in: gs://abc/google-cloud-dataproc-metainfo/266175ef-e595-4732-b351-335837a3f30e/test-m/dataproc-initialization-script-2_output
- Initialization action failed. Failed action 'gs://dataproc-initialization-actions/jupyter/jupyter.sh', see output in: gs://abc/google-cloud-dataproc-metainfo/266175ef-e595-4732-b351-335837a3f30e/test-w-0/dataproc-initialization-script-2_output
- Initialization action failed. Failed action 'gs://dataproc-initialization-actions/jupyter/jupyter.sh', see output in: gs://abc/google-cloud-dataproc-metainfo/266175ef-e595-4732-b351-335837a3f30e/test-w-1/dataproc-initialization-script-2_output.
test-m 中的文件如下所示:
++ /usr/share/google/get_metadata_value attributes/dataproc-role
+ readonly ROLE=Worker
+ ROLE=Worker
++ /usr/share/google/get_metadata_value attributes/INIT_ACTIONS_REPO
++ echo https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
+ readonly INIT_ACTIONS_REPO=https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
+ INIT_ACTIONS_REPO=https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
++ /usr/share/google/get_metadata_value attributes/INIT_ACTIONS_BRANCH
++ echo master
+ readonly INIT_ACTIONS_BRANCH=master
+ INIT_ACTIONS_BRANCH=master
++ /usr/share/google/get_metadata_value attributes/JUPYTER_CONDA_CHANNELS
+ readonly JUPYTER_CONDA_CHANNELS=
+ JUPYTER_CONDA_CHANNELS=
++ /usr/share/google/get_metadata_value attributes/JUPYTER_CONDA_PACKAGES
+ readonly JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn
+ JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn
+ echo 'Cloning fresh dataproc-initialization-actions from repo https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git and branch master...'
Cloning fresh dataproc-initialization-actions from repo https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git and branch master...
+ git clone -b master --single-branch https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
fatal: destination path 'dataproc-initialization-actions' already exists and is not an empty directory.
看起来有一个克隆步骤阻止了安装成功。我该如何解决这个问题?任何建议表示赞赏,谢谢。
这似乎是 init 操作中的错误,我们无法 git clone
存储库两次。我们会解决这个问题。
同时,您可以尝试 Jupyter optional component 使用数据实验室初始化操作。
我想在安装了 Jupyter 和 DataLab 的 DataProc 中创建一个集群(我知道它们非常相似,但团队成员有不同的偏好)。我可以用它们中的任何一个创建集群:
使用 Jupyter 的集群:
gcloud dataproc clusters create $DATAPROC_CLUSTER_NAME_JUPYTER \
--project $PROJECT \
--bucket $BUCKET \
--zone $ZONE \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--metadata JUPYTER_PORT=$JUPYTER_PORT,JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn
DataLab 集群:
gcloud dataproc clusters create $DATAPROC_CLUSTER_NAME_DATALAB \
--project $PROJECT \
--bucket $BUCKET \
--zone $ZONE \
--master-boot-disk-size $MASTER_DISK_SIZE \
--worker-boot-disk-size $WORKER_DISK_SIZE \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--scopes cloud-platform,bigquery
两者都很好用。但是,当我尝试用它们创建一个集群时,它失败了:
gcloud dataproc clusters create test \
--project $PROJECT \
--bucket $BUCKET \
--zone $ZONE \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh,gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--metadata JUPYTER_PORT=$JUPYTER_PORT,JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn \
--scopes cloud-platform,bigquery
错误信息是:
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/abc/regions/global/operations/d34943dc-5bda-386f-af91-db6e0516e2c5] failed: Multiple Errors:
- Initialization action failed. Failed action 'gs://dataproc-initialization-actions/jupyter/jupyter.sh', see output in: gs://abc/google-cloud-dataproc-metainfo/266175ef-e595-4732-b351-335837a3f30e/test-m/dataproc-initialization-script-2_output
- Initialization action failed. Failed action 'gs://dataproc-initialization-actions/jupyter/jupyter.sh', see output in: gs://abc/google-cloud-dataproc-metainfo/266175ef-e595-4732-b351-335837a3f30e/test-w-0/dataproc-initialization-script-2_output
- Initialization action failed. Failed action 'gs://dataproc-initialization-actions/jupyter/jupyter.sh', see output in: gs://abc/google-cloud-dataproc-metainfo/266175ef-e595-4732-b351-335837a3f30e/test-w-1/dataproc-initialization-script-2_output.
test-m 中的文件如下所示:
++ /usr/share/google/get_metadata_value attributes/dataproc-role
+ readonly ROLE=Worker
+ ROLE=Worker
++ /usr/share/google/get_metadata_value attributes/INIT_ACTIONS_REPO
++ echo https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
+ readonly INIT_ACTIONS_REPO=https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
+ INIT_ACTIONS_REPO=https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
++ /usr/share/google/get_metadata_value attributes/INIT_ACTIONS_BRANCH
++ echo master
+ readonly INIT_ACTIONS_BRANCH=master
+ INIT_ACTIONS_BRANCH=master
++ /usr/share/google/get_metadata_value attributes/JUPYTER_CONDA_CHANNELS
+ readonly JUPYTER_CONDA_CHANNELS=
+ JUPYTER_CONDA_CHANNELS=
++ /usr/share/google/get_metadata_value attributes/JUPYTER_CONDA_PACKAGES
+ readonly JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn
+ JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn
+ echo 'Cloning fresh dataproc-initialization-actions from repo https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git and branch master...'
Cloning fresh dataproc-initialization-actions from repo https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git and branch master...
+ git clone -b master --single-branch https://github.com/GoogleCloudPlatform/dataproc-initialization-actions.git
fatal: destination path 'dataproc-initialization-actions' already exists and is not an empty directory.
看起来有一个克隆步骤阻止了安装成功。我该如何解决这个问题?任何建议表示赞赏,谢谢。
这似乎是 init 操作中的错误,我们无法 git clone
存储库两次。我们会解决这个问题。
同时,您可以尝试 Jupyter optional component 使用数据实验室初始化操作。