Keras google cloudml 示例:IndexError
Keras google cloudml sample: IndexError
我正在尝试 keras cloudml 示例 (https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/keras),但我似乎无法 运行 云训练。 python 和 gcloud 的本地培训似乎进展顺利。
我在 stackexchange 上寻找解决方案,google 并阅读了 https://cloud.google.com/ml-engine/docs/how-tos/troubleshooting,但我似乎是唯一遇到此问题的人(通常强烈表明错误完全是我的! ) 。除了下面的环境,我还尝试了 python 3.6 和 tensorflow 1.3,但没有成功。
我是菜鸟,所以我可能犯了一些基本的错误,但我无法发现。
感谢所有帮助,
:-)
yarc68000。
--环境-
(env1) $ python --version
Python 2.7.13 :: Continuum Analytics, Inc.
(env1) $ conda list | grep 'h5py\|keras\|pandas\|numexpr\|tensorflow'
h5py 2.7.1 py27_1 conda-forge
keras 2.0.6 py27_0 conda-forge
numexpr 2.6.2 py27_1 conda-forge
pandas 0.20.3 py27_0 anaconda
tensorflow 1.2.1 <pip>
(env1) $ gcloud --version
Google Cloud SDK 172.0.1
alpha 2017.09.15
beta 2017.09.15
bq 2.0.26
core 2017.09.21
datalab 20170818
gcloud
gsutil 4.27
------------工作--------
(env1) $ export BUCKET=gs://j170922census1
(env1) $ gsutil mb $BUCKET
Creating gs://j170922census1/...
(env1) $ export TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv
(env1) $ export EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv
(env1) $ export JOB_NAME="census_keras_$$"
(env1) $ export TRAIN_STEPS=200
(env1) $ gcloud ml-engine jobs submit training $JOB_NAME --stream-logs --runtime-version 1.2 --job-dir $BUCKET --package-path trainer --module-name trainer.task --region us-central1 -- --train-files $TRAIN_FILE --eval-files $EVAL_FILE --train-steps $TRAIN_STEPS
Job [census_keras_7639] submitted successfully.
INFO 2017-09-22 19:56:56 +0200 service Validating job requirements...
INFO 2017-09-22 19:56:57 +0200 service Job creation request has been successfully validated.
INFO 2017-09-22 19:56:57 +0200 service Job census_keras_7639 is queued.
INFO 2017-09-22 19:56:57 +0200 service Waiting for job to be provisioned.
INFO 2017-09-22 20:01:39 +0200 service Waiting for TensorFlow to start.
INFO 2017-09-22 20:02:55 +0200 master-replica-0 Running task with arguments: --cluster={"master": ["master-cc38d44a51-0:2222"]} --task={"type": "master", "index": 0} --job={
<..>
INFO 2017-09-22 20:04:00 +0200 master-replica-0 197/200 [============================>.] - ETA: 0s - loss: 0.6931 - acc: 0.7563
INFO 2017-09-22 20:04:00 +0200 master-replica-0 200/200 [==============================] - 1s - loss: 0.6931 - acc: 0.7600
INFO 2017-09-22 20:04:00 +0200 master-replica-0 Epoch 10/20
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 Traceback (most recent call last):
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 "__main__", fname, loader, pkg_name)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 exec code in run_globals
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 199, in <module>
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 dispatch(**parse_args.__dict__)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 121, in dispatch
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks=callbacks)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/models.py", line 1110, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 initial_epoch=initial_epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1849, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks.on_epoch_begin(epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/callbacks.py", line 63, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callback.on_epoch_begin(epoch, logs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 census_model = load_model(checkpoints[-1])
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 IndexError: list index out of range
<..>
INFO 2017-09-22 20:04:53 +0200 service Finished tearing down TensorFlow.
INFO 2017-09-22 20:05:49 +0200 service Job failed.
在 Cloud ML Engine 上 运行 这实际上是一个错误,因为现在在 GCS 上禁用了检查点(Keras 无法在本地将检查点写入 GCS)。请参阅此 PR for the immediate fix for the issue you are facing. Also take a look at pending PR,它修复了检查点问题并使文件在 GCS 上可用(无法为 Keras 执行 GCS 写入的解决方法)。
我正在尝试 keras cloudml 示例 (https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/keras),但我似乎无法 运行 云训练。 python 和 gcloud 的本地培训似乎进展顺利。
我在 stackexchange 上寻找解决方案,google 并阅读了 https://cloud.google.com/ml-engine/docs/how-tos/troubleshooting,但我似乎是唯一遇到此问题的人(通常强烈表明错误完全是我的! ) 。除了下面的环境,我还尝试了 python 3.6 和 tensorflow 1.3,但没有成功。
我是菜鸟,所以我可能犯了一些基本的错误,但我无法发现。
感谢所有帮助,
:-)
yarc68000。
--环境-
(env1) $ python --version
Python 2.7.13 :: Continuum Analytics, Inc.
(env1) $ conda list | grep 'h5py\|keras\|pandas\|numexpr\|tensorflow'
h5py 2.7.1 py27_1 conda-forge
keras 2.0.6 py27_0 conda-forge
numexpr 2.6.2 py27_1 conda-forge
pandas 0.20.3 py27_0 anaconda
tensorflow 1.2.1 <pip>
(env1) $ gcloud --version
Google Cloud SDK 172.0.1
alpha 2017.09.15
beta 2017.09.15
bq 2.0.26
core 2017.09.21
datalab 20170818
gcloud
gsutil 4.27
------------工作--------
(env1) $ export BUCKET=gs://j170922census1
(env1) $ gsutil mb $BUCKET
Creating gs://j170922census1/...
(env1) $ export TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv
(env1) $ export EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv
(env1) $ export JOB_NAME="census_keras_$$"
(env1) $ export TRAIN_STEPS=200
(env1) $ gcloud ml-engine jobs submit training $JOB_NAME --stream-logs --runtime-version 1.2 --job-dir $BUCKET --package-path trainer --module-name trainer.task --region us-central1 -- --train-files $TRAIN_FILE --eval-files $EVAL_FILE --train-steps $TRAIN_STEPS
Job [census_keras_7639] submitted successfully.
INFO 2017-09-22 19:56:56 +0200 service Validating job requirements...
INFO 2017-09-22 19:56:57 +0200 service Job creation request has been successfully validated.
INFO 2017-09-22 19:56:57 +0200 service Job census_keras_7639 is queued.
INFO 2017-09-22 19:56:57 +0200 service Waiting for job to be provisioned.
INFO 2017-09-22 20:01:39 +0200 service Waiting for TensorFlow to start.
INFO 2017-09-22 20:02:55 +0200 master-replica-0 Running task with arguments: --cluster={"master": ["master-cc38d44a51-0:2222"]} --task={"type": "master", "index": 0} --job={
<..>
INFO 2017-09-22 20:04:00 +0200 master-replica-0 197/200 [============================>.] - ETA: 0s - loss: 0.6931 - acc: 0.7563
INFO 2017-09-22 20:04:00 +0200 master-replica-0 200/200 [==============================] - 1s - loss: 0.6931 - acc: 0.7600
INFO 2017-09-22 20:04:00 +0200 master-replica-0 Epoch 10/20
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 Traceback (most recent call last):
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 "__main__", fname, loader, pkg_name)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 exec code in run_globals
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 199, in <module>
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 dispatch(**parse_args.__dict__)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 121, in dispatch
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks=callbacks)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/models.py", line 1110, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 initial_epoch=initial_epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1849, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks.on_epoch_begin(epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/callbacks.py", line 63, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callback.on_epoch_begin(epoch, logs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 census_model = load_model(checkpoints[-1])
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 IndexError: list index out of range
<..>
INFO 2017-09-22 20:04:53 +0200 service Finished tearing down TensorFlow.
INFO 2017-09-22 20:05:49 +0200 service Job failed.
在 Cloud ML Engine 上 运行 这实际上是一个错误,因为现在在 GCS 上禁用了检查点(Keras 无法在本地将检查点写入 GCS)。请参阅此 PR for the immediate fix for the issue you are facing. Also take a look at pending PR,它修复了检查点问题并使文件在 GCS 上可用(无法为 Keras 执行 GCS 写入的解决方法)。