Dataproc 集群创建失败并出现 PIP 错误 "Could not build wheels"

Dataproc Cluster creation is failing with PIP error "Could not build wheels"

我们使用以下配置旋转集群。它在上周之前 运行 还不错,但现在失败了 error ERROR: Failed cleaning build dir for libcst Failed to build libcst ERROR: Could not build wheels for libcst which use PEP 517 and cannot be installed directly

Building wheels for collected packages: pynacl, libcst
  Building wheel for pynacl (PEP 517): started
  Building wheel for pynacl (PEP 517): still running...
  Building wheel for pynacl (PEP 517): finished with status 'done'
  Created wheel for pynacl: filename=PyNaCl-1.5.0-cp37-cp37m-linux_x86_64.whl size=201317 sha256=4e5897bc415a327f6b389b864940a8c1dde9448017a2ce4991517b30996acb71
  Stored in directory: /root/.cache/pip/wheels/2f/01/7f/11d382bf954a093a55ed9581fd66c3b45b98769f292367b4d3
  Building wheel for libcst (PEP 517): started
  Building wheel for libcst (PEP 517): finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/anaconda/bin/python /opt/conda/anaconda/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmpon3bonqi
       cwd: /tmp/pip-install-9ozf4fcp/libcst

集群配置命令:

gcloud dataproc clusters create cluster-test \
--enable-component-gateway \
--region us-east1 \
--zone us-east1-b \
--master-machine-type n1-highmem-32 \
--master-boot-disk-size 500 \
--num-workers 3 \
--worker-machine-type n1-highmem-16 \
--worker-boot-disk-size 500 \
--optional-components ANACONDA,JUPYTER,ZEPPELIN \
--image-version 1.5.54-ubuntu18 \
--tags <tag-name> \
--bucket '<cloud storage bucket>' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/connectors.sh','gs://goog-dataproc-initialization-actions-us-east1/python/pip-install.sh' \
--metadata='PIP_PACKAGES=wheel datalab xgboost==1.3.3 shap oyaml click apache-airflow apache-airflow-providers-google' \
--initialization-action-timeout 30m \
--metadata gcs-connector-version=2.1.1,bigquery-connector-version=1.1.1,spark-bigquery-connector-version=0.17.2 \
--project <project-name>

我尝试过的事情: a) 我尝试将 wheel 包显式安装为 pip 包的一部分,但问题没有解决

b) 带有升级 pip scrip 的 Gcloud 命令t:

gcloud dataproc clusters create cluster-test \
--enable-component-gateway \
--region us-east1 \
--zone us-east1-b \
--master-machine-type n1-highmem-32 \
--master-boot-disk-size 500 \
--num-workers 3 \
--worker-machine-type n1-highmem-16 \
--worker-boot-disk-size 500 \
--optional-components ANACONDA,JUPYTER,ZEPPELIN \
--image-version 1.5.54-ubuntu18 \
--tags <tag-name> \
--bucket '<cloud storage bucket>' \
--initialization-actions 'gs://goog-dataproc-initialization-actions-us-east1/connectors/connectors.sh','gs://<bucket-path>/upgrade-pip.sh','gs://goog-dataproc-initialization-actions-us-east1/python/pip-install.sh' \
--metadata='PIP_PACKAGES=wheel datalab xgboost==1.3.3 shap oyaml click apache-airflow apache-airflow-providers-google' \
--initialization-action-timeout 30m \
--metadata gcs-connector-version=2.1.1,bigquery-connector-version=1.1.1,spark-bigquery-connector-version=0.17.2 \
--project <project-name>

您似乎需要升级 pip,请参阅此 question

但是一个Dataproc集群中可以有多个pip,你需要选择合适的

  1. 对于 init actions,在集群创建时,/opt/conda/default 是 link 到 /opt/conda/miniconda3/opt/conda/anaconda,取决于你选择的Conda env,默认是Miniconda3,但你的情况是Anaconda。所以你可以 运行 /opt/conda/default/bin/pip install --upgrade pip/opt/conda/anaconda/bin/pip install --upgrade pip.

  2. 对于自定义图像,在创建图像时,您想使用显式完整路径,/opt/conda/anaconda/bin/pip install --upgrade pip 用于 Anaconda,或 /opt/conda/miniconda3/bin/pip install --upgrade pip 对于 Miniconda3.

因此,您可以简单地对初始操作和自定义图像使用 /opt/conda/anaconda/bin/pip install --upgrade pip