Keras google cloudml 示例:IndexError

Keras google cloudml sample: IndexError

我正在尝试 keras cloudml 示例 (https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/keras),但我似乎无法 运行 云训练。 python 和 gcloud 的本地培训似乎进展顺利。

我在 stackexchange 上寻找解决方案,google 并阅读了 https://cloud.google.com/ml-engine/docs/how-tos/troubleshooting,但我似乎是唯一遇到此问题的人(通常强烈表明错误完全是我的! ) 。除了下面的环境,我还尝试了 python 3.6 和 tensorflow 1.3,但没有成功。

我是菜鸟,所以我可能犯了一些基本的错误,但我无法发现。

感谢所有帮助,

:-)

yarc68000。

--环境-

(env1) $ python --version
Python 2.7.13 :: Continuum Analytics, Inc.
(env1) $ conda list | grep 'h5py\|keras\|pandas\|numexpr\|tensorflow'
h5py                      2.7.1                    py27_1    conda-forge
keras                     2.0.6                    py27_0    conda-forge
numexpr                   2.6.2                    py27_1    conda-forge
pandas                    0.20.3                   py27_0    anaconda
tensorflow                1.2.1                     <pip>
(env1) $ gcloud --version
Google Cloud SDK 172.0.1
alpha 2017.09.15
beta 2017.09.15
bq 2.0.26
core 2017.09.21
datalab 20170818
gcloud 
gsutil 4.27

------------工作--------

(env1) $ export BUCKET=gs://j170922census1
(env1) $ gsutil mb $BUCKET
Creating gs://j170922census1/...
(env1) $ export TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv
(env1) $ export EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv
(env1) $ export JOB_NAME="census_keras_$$"
(env1) $ export TRAIN_STEPS=200
(env1) $ gcloud ml-engine jobs submit training $JOB_NAME --stream-logs --runtime-version 1.2 --job-dir $BUCKET --package-path trainer --module-name trainer.task --region us-central1 -- --train-files $TRAIN_FILE --eval-files $EVAL_FILE --train-steps $TRAIN_STEPS
Job [census_keras_7639] submitted successfully.
INFO    2017-09-22 19:56:56 +0200   service     Validating job requirements...
INFO    2017-09-22 19:56:57 +0200   service     Job creation request has been successfully validated.
INFO    2017-09-22 19:56:57 +0200   service     Job census_keras_7639 is queued.
INFO    2017-09-22 19:56:57 +0200   service     Waiting for job to be provisioned.
INFO    2017-09-22 20:01:39 +0200   service     Waiting for TensorFlow to start.
INFO    2017-09-22 20:02:55 +0200   master-replica-0        Running task with arguments: --cluster={"master": ["master-cc38d44a51-0:2222"]} --task={"type": "master", "index": 0} --job={
<..>
INFO    2017-09-22 20:04:00 +0200   master-replica-0        197/200 [============================>.] - ETA: 0s - loss: 0.6931 - acc: 0.7563
INFO    2017-09-22 20:04:00 +0200   master-replica-0        200/200 [==============================] - 1s - loss: 0.6931 - acc: 0.7600     
INFO    2017-09-22 20:04:00 +0200   master-replica-0        Epoch 10/20
ERROR   2017-09-22 20:04:02 +0200   master-replica-0        Traceback (most recent call last):
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            "__main__", fname, loader, pkg_name)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            exec code in run_globals
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 199, in <module>
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            dispatch(**parse_args.__dict__)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 121, in dispatch
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            callbacks=callbacks)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            return func(*args, **kwargs)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/models.py", line 1110, in fit_generator
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            initial_epoch=initial_epoch)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            return func(*args, **kwargs)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1849, in fit_generator
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            callbacks.on_epoch_begin(epoch)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/keras/callbacks.py", line 63, in on_epoch_begin
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            callback.on_epoch_begin(epoch, logs)
ERROR   2017-09-22 20:04:02 +0200   master-replica-0          File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in on_epoch_begin
ERROR   2017-09-22 20:04:02 +0200   master-replica-0            census_model = load_model(checkpoints[-1])
ERROR   2017-09-22 20:04:02 +0200   master-replica-0        IndexError: list index out of range
<..>
INFO    2017-09-22 20:04:53 +0200   service     Finished tearing down TensorFlow.
INFO    2017-09-22 20:05:49 +0200   service     Job failed.

在 Cloud ML Engine 上 运行 这实际上是一个错误,因为现在在 GCS 上禁用了检查点(Keras 无法在本地将检查点写入 GCS)。请参阅此 PR for the immediate fix for the issue you are facing. Also take a look at pending PR,它修复了检查点问题并使文件在 GCS 上可用(无法为 Keras 执行 GCS 写入的解决方法)。