完成 GeneratorDataset 迭代器时发生错误：已取消：操作已取消

Question

虽然运行 kubeflow 管道具有使用 tensorflow 2.0 的代码。每个时期结束时显示以下错误

W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

此外，经过一些时期后，它不显示日志并显示此错误

This step is in Failed state with this message: The node was low on resource: memory. Container main was using 100213872Ki, which exceeds its request of 0. Container wait was using 25056Ki, which exceeds its request of 0.

Answer 1

以我为例：我安装了 tf-nightly。现在它正在工作，虽然我是张量流的新手。我关注了这个 link

你可以试试

Answer 2

我也遇到了同样的问题。人们声称加热是多余的，它已在 tf-nightly 中删除，请参阅 here。但是内存泄漏对于每个 epoch 仍然存在。

Answer 3

这是由于 CUDA 和 Tensorflow 版本不兼容造成的。以下版本相互配合良好

tensorflow-gpu==2.0.0

tensorflow-addons==0.6.0

nvidia/cuda:10.0-cudnn7-runtime

Answer 4

就我而言，我没有匹配 batch_size 和 steps_per_epoch

例如，

his = Test_model.fit_generator(datagen.flow(trainrancrop_images, trainrancrop_labels, batch_size=batchsize),
                               steps_per_epoch=len(trainrancrop_images)/batchsize,
                               validation_data=(test_images, test_labels),
                               epochs=1,
                               callbacks=[callback])

datagen.flow中的

batch_size必须对应Test_model.fit_generator中的steps_per_epoch （实际上，我在 steps_per_epoch 上使用了错误的值）

我猜这是错误的情况之一。

因此，我认为当批量大小和步骤（迭代）的对应关系错误时就会出现问题

当您通过除法获得步长时，浮点数可能是个问题...

检查有关此问题的代码。

祝你好运:)

Answer 5

将 tensorflow 从 2.1 升级到 2.2 为我解决了这个问题。我不必去 tf-nightly 版本。

Answer 6

要解决此问题，您可以在 model.fit(...) 中添加 workers=1。

Answer 7

我尝试了以下步骤并且在我的案例中有效

conda install tensorflow=2.0.0
conda install -c conda-forge keras=2.3.0

完成 GeneratorDataset 迭代器时发生错误：已取消：操作已取消

Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

kubeflow

tensorflow2.0

kubeflow-pipelines