当 tensorflow 为 运行 时无法启动 Tensorboard

Cannot start Tensorboard when tensorflow is running

当 tensorflow 已经 运行 并且正在使用 GPU 时,我无法启动 tensorboard 实例。错误如下。显然,Tensorflow 在启动时会阻止所有 GPU 内存,而不管它实际需要什么。有没有办法在 tensorflow 进程 运行 时启动 tensorboard,还是它总是先启动?

totalMemory: 5,93GiB freeMemory: 41,56MiB
2018-06-02 15:28:11.053634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-02 15:28:11.321850: E tensorflow/core/common_runtime/direct_session.cc:154] Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory
Traceback (most recent call last):
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/bin/tensorboard", line 11, in <module>
    sys.exit(run_main())
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/main.py", line 36, in run_main
    tf.app.run(main)
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/main.py", line 45, in main
    default.get_assets_zip_provider())
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/program.py", line 166, in main
    tb = create_tb_app(plugins, assets_zip_provider)
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/program.py", line 201, in create_tb_app
    flags=FLAGS)
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/backend/application.py", line 126, in standard_tensorboard_wsgi
    plugin_instances = [constructor(context) for constructor in plugins]
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/backend/application.py", line 126, in <listcomp>
    plugin_instances = [constructor(context) for constructor in plugins]
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/plugins/beholder/beholder_plugin.py", line 47, in __init__
    self.most_recent_frame = im_util.get_image_relative_to_script('no-data.png')
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/plugins/beholder/im_util.py", line 254, in get_image_relative_to_script
    return read_image(filename)
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/plugins/beholder/im_util.py", line 242, in read_image
    return np.array(decode_png(image_file.read()))
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/plugins/beholder/im_util.py", line 159, in __call__
    self._lazily_initialize()
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorboard/plugins/beholder/im_util.py", line 137, in _lazily_initialize
    self._session = tf.Session(graph=graph, config=config)
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1560, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/pascalwhoop/Documents/Code/University/powerTAC/python-agent/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 633, in __init__
    self._session = tf_session.TF_NewSession(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Tensorboard 1.7.0 似乎在 GPU 上占用了大约 150MB。参见 this open Tensorboard issue。看起来正在解决中。

过渡期间的一个选项是限制允许 Tensorflow 预先分配给每个进程的内存百分比,详见 。通过这种方式,您可以确保为 GPU 上的其他任务预留一定比例的内存,这些任务在训练期间可能 运行。

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))