MobilenetSSDv2 冻结迁移学习
MobilenetSSDv2 freeze on transfer learning
我正在使用 Mobilenet-SSD-v2 训练一个模型,它训练了一段时间,然后尝试评估,然后卡住了。
我是 运行 tensorflow-gpu 1.14,在 tensorflow/tensorflow:latest-gpu
docker 图像中。我在 ubuntu 19.04 上使用 RTX 2060。我正在使用此 git 存储库中的最新对象检测 API:https://github.com/tensorflow/models。
我尝试在 model_lib.py 中设置 throttle_secs,但没有任何作用。我仍然可以训练,但每次它尝试逃生时,我都需要重新启动 docker 容器。
我只使用 git 存储库提供的代码。我使用下面的命令开始训练。
PIPELINE_CONFIG_PATH=/tensorflow/models/research/face/pipeline.config
MODEL_DIR=/tensorflow/models/research/face/training/
NUM_TRAIN_STEPS=50000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--alsologtostderr
我预计它会继续训练。但是我只是卡住了,需要重新启动。
I1002 18:28:30.106040 139663203059520 evaluation.py:255] Starting evaluation at 2019-10-02T18:28:30Z
I1002 18:28:30.717183 139663203059520 monitored_session.py:240] Graph was finalized.
2019-10-02 18:28:30.717937: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.718182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:0a:00.0
2019-10-02 18:28:30.718232: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-02 18:28:30.718251: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-02 18:28:30.718263: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-02 18:28:30.718279: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-02 18:28:30.718295: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-02 18:28:30.718309: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-02 18:28:30.718326: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-02 18:28:30.718401: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.718655: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.718861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-02 18:28:30.718888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-02 18:28:30.718898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-10-02 18:28:30.718907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-10-02 18:28:30.718992: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.719242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.719460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4946 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:0a:00.0, compute capability: 7.5)
I1002 18:28:30.720419 139663203059520 saver.py:1280] Restoring parameters from /tensorflow/models/research/face/training/model.ckpt-10756
I1002 18:28:32.285661 139663203059520 session_manager.py:500] Running local_init_op.
I1002 18:28:32.408489 139663203059520 session_manager.py:502] Done running local_init_op.
我在 6-7 个月前遇到了同样的问题,但找不到解决方案。但是,我试图从头开始创建一个新环境。下面列出了我的工作环境的详细信息。
# Name Version Build Channel
absl-py 0.8.0 pypi_0 pypi
astor 0.8.0 pypi_0 pypi
bleach 1.5.0 pypi_0 pypi
certifi 2018.8.24 py35_1 anaconda
contextlib2 0.5.5 pypi_0 pypi
cycler 0.10.0 pypi_0 pypi
cython 0.29.13 pypi_0 pypi
gast 0.3.2 pypi_0 pypi
grpcio 1.23.0 pypi_0 pypi
html5lib 0.9999999 pypi_0 pypi
kiwisolver 1.1.0 pypi_0 pypi
libprotobuf 3.6.0 h1a1b453_0 anaconda
lxml 4.4.1 pypi_0 pypi
markdown 3.1.1 pypi_0 pypi
matplotlib 3.0.3 pypi_0 pypi
numpy 1.17.2 pypi_0 pypi
opencv-python 4.1.1.26 pypi_0 pypi
pandas 0.25.1 pypi_0 pypi
pillow 6.1.0 pypi_0 pypi
pip 19.2.3 pypi_0 pypi
protobuf 3.9.1 pypi_0 pypi
pyparsing 2.4.2 pypi_0 pypi
python 3.5.6 he025d50_0
python-dateutil 2.8.0 pypi_0 pypi
pytz 2019.2 pypi_0 pypi
setuptools 41.2.0 pypi_0 pypi
six 1.12.0 pypi_0 pypi
tensorboard 1.8.0 pypi_0 pypi
tensorflow-gpu 1.8.0 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
vc 14.1 h21ff451_3 anaconda
vs2015_runtime 15.5.2 3 anaconda
werkzeug 0.15.6 pypi_0 pypi
wheel 0.33.6 pypi_0 pypi
wincertstore 0.2 py35hfebbdb8_0
zlib 1.2.11 h62dcd97_3 anaconda
注意:
你可以看出我的python版本是3.5,这是我家的电脑。我的工作电脑具有与 python 3.6.8 完全相同的软件包。所以这也适用于 3.6。
此外,我相信 tensorflow/models
可以与以前版本的 tensorflow 一起使用,如您所见,我的版本是 1.8.0。当我遇到同样的问题时,我正在使用 1.13。
我希望它能解决。
我正在使用 Mobilenet-SSD-v2 训练一个模型,它训练了一段时间,然后尝试评估,然后卡住了。
我是 运行 tensorflow-gpu 1.14,在 tensorflow/tensorflow:latest-gpu
docker 图像中。我在 ubuntu 19.04 上使用 RTX 2060。我正在使用此 git 存储库中的最新对象检测 API:https://github.com/tensorflow/models。
我尝试在 model_lib.py 中设置 throttle_secs,但没有任何作用。我仍然可以训练,但每次它尝试逃生时,我都需要重新启动 docker 容器。
我只使用 git 存储库提供的代码。我使用下面的命令开始训练。
PIPELINE_CONFIG_PATH=/tensorflow/models/research/face/pipeline.config
MODEL_DIR=/tensorflow/models/research/face/training/
NUM_TRAIN_STEPS=50000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--alsologtostderr
我预计它会继续训练。但是我只是卡住了,需要重新启动。
I1002 18:28:30.106040 139663203059520 evaluation.py:255] Starting evaluation at 2019-10-02T18:28:30Z
I1002 18:28:30.717183 139663203059520 monitored_session.py:240] Graph was finalized.
2019-10-02 18:28:30.717937: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.718182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:0a:00.0
2019-10-02 18:28:30.718232: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-02 18:28:30.718251: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-02 18:28:30.718263: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-02 18:28:30.718279: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-02 18:28:30.718295: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-02 18:28:30.718309: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-02 18:28:30.718326: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-02 18:28:30.718401: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.718655: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.718861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-10-02 18:28:30.718888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-02 18:28:30.718898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2019-10-02 18:28:30.718907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2019-10-02 18:28:30.718992: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.719242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-02 18:28:30.719460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4946 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:0a:00.0, compute capability: 7.5)
I1002 18:28:30.720419 139663203059520 saver.py:1280] Restoring parameters from /tensorflow/models/research/face/training/model.ckpt-10756
I1002 18:28:32.285661 139663203059520 session_manager.py:500] Running local_init_op.
I1002 18:28:32.408489 139663203059520 session_manager.py:502] Done running local_init_op.
我在 6-7 个月前遇到了同样的问题,但找不到解决方案。但是,我试图从头开始创建一个新环境。下面列出了我的工作环境的详细信息。
# Name Version Build Channel
absl-py 0.8.0 pypi_0 pypi
astor 0.8.0 pypi_0 pypi
bleach 1.5.0 pypi_0 pypi
certifi 2018.8.24 py35_1 anaconda
contextlib2 0.5.5 pypi_0 pypi
cycler 0.10.0 pypi_0 pypi
cython 0.29.13 pypi_0 pypi
gast 0.3.2 pypi_0 pypi
grpcio 1.23.0 pypi_0 pypi
html5lib 0.9999999 pypi_0 pypi
kiwisolver 1.1.0 pypi_0 pypi
libprotobuf 3.6.0 h1a1b453_0 anaconda
lxml 4.4.1 pypi_0 pypi
markdown 3.1.1 pypi_0 pypi
matplotlib 3.0.3 pypi_0 pypi
numpy 1.17.2 pypi_0 pypi
opencv-python 4.1.1.26 pypi_0 pypi
pandas 0.25.1 pypi_0 pypi
pillow 6.1.0 pypi_0 pypi
pip 19.2.3 pypi_0 pypi
protobuf 3.9.1 pypi_0 pypi
pyparsing 2.4.2 pypi_0 pypi
python 3.5.6 he025d50_0
python-dateutil 2.8.0 pypi_0 pypi
pytz 2019.2 pypi_0 pypi
setuptools 41.2.0 pypi_0 pypi
six 1.12.0 pypi_0 pypi
tensorboard 1.8.0 pypi_0 pypi
tensorflow-gpu 1.8.0 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
vc 14.1 h21ff451_3 anaconda
vs2015_runtime 15.5.2 3 anaconda
werkzeug 0.15.6 pypi_0 pypi
wheel 0.33.6 pypi_0 pypi
wincertstore 0.2 py35hfebbdb8_0
zlib 1.2.11 h62dcd97_3 anaconda
注意:
你可以看出我的python版本是3.5,这是我家的电脑。我的工作电脑具有与 python 3.6.8 完全相同的软件包。所以这也适用于 3.6。
此外,我相信 tensorflow/models
可以与以前版本的 tensorflow 一起使用,如您所见,我的版本是 1.8.0。当我遇到同样的问题时,我正在使用 1.13。
我希望它能解决。