GCP - CDAP - Dataproc 集群卡在运行状态

Question

我们有一个由 Cloud Composer DAG 触发的 DataFusion 管道。此管道提供一个临时 DataProc 集群，该集群 - 在理想情况下 - 在完成任务后终止。

在我们的例子中，有时，并非总是，这个短暂的 DataProc 集群会卡在运行状态。集群中的作业也处于运行状态，最后的日志消息如下：

INFO runtimejob.DataprocJobMain: Invoking initialize() on io.cdap.cdap.runtime.spi.runtimejob.DataprocRuntimeEnvironment with spark2_2.11
INFO runtimejob.DataprocJobMain: Invoking run() on io.cdap.cdap.internal.app.runtime.distributed.runtimejob.DefaultRuntimeJob
INFO runtimejob.DataprocJobMain: Invoking destroy() on io.cdap.cdap.internal.app.runtime.distributed.runtimejob.DefaultRuntimeJob
INFO runtimejob.DataprocJobMain: Runtime job completed.
Exception: java.lang.NoClassDefFoundError thrown from the UncaughtExceptionHandler in thread " STARTING-SendThread(cdap-<our-identifier>-1f11111b-1d11-11eb-b1a1-1a111fb11d11-m.c.<our-gcp-project-name>.internal:41409)"
Exception: java.lang.NoClassDefFoundError thrown from the UncaughtExceptionHandler in thread "threadDeathWatcher-2-1"

在 DataFusion 端，管道标记为成功。 DataFusion 日志如下：

Completed DEPROVISION subtask REQUESTING_DELETE for program run program_run: <data_fusion_namespace>.<pipeline_name>.-SNAPSHOT.workflow.DataPipelineWorkflow.<data_proc_id> //this message is repeated many-many times
DEBUG [provisioning-service-4:i.c.c.c.s.Retries@197] - Retries exhausted after 1 failures and 14 ms.

知道是什么导致了这个问题吗？

p.s.: 消息中的标识符被替换为随机值

Answer 1

您是哪个版本的 Datafusion 运行？另外，Dataproc 集群的内存量是多少？有时，当 Dataproc 集群运行内存不足时，我们会观察到此问题。我建议增加内存量。

GCP - CDAP - Dataproc 集群卡在运行状态

GCP - CDAP - Dataproc cluster stucks in running state

java

mapreduce

apache-spark

google-cloud-dataproc

cdap

GCP - CDAP - Dataproc 集群卡在 运行 状态

GCP - CDAP - Dataproc cluster stucks in running state

java

mapreduce

apache-spark

google-cloud-dataproc

cdap

GCP - CDAP - Dataproc 集群卡在运行状态