分布式 Mxnet 培训中的权限被拒绝

Permission denied in distributed Mxnet training

我正在研究 MXNet,这是一个深度学习库。我实现的结构在单机和分布式 CPU 机器中。我关注了 MXNet 官方网站上的 tutorial。 在单机上的实现 运行ning 没有任何问题,我得到了结果。

然后我尝试使用多台 CPU 机器进行分布式训练。 我在 AWS,amazon 虚拟机上创建了一个帐户,并启动了 t2.micro ubuntu 的 3 个。 我输入了以下命令行:

../../tools/launch.py -n 2 python train_mnist.py --kv-store dist_sync

上面的命令行假设 运行 一个分布式版本训练有 2 个工人和 1 个服务器。

不幸的是,我遇到了一个错误。我知道有一个权限被拒绝,但我尝试通过以下命令从服务器访问其他两个工作人员并且它有效:

ssh -i key.pem ubuntu@ip number.

这里是错误

Permission denied (publickey).
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
    subprocess.check_call(prog, shell = True)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_WORKER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255

Permission denied (publickey).
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
    subprocess.check_call(prog, shell = True)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255

[03:48:39] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:300: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]

Traceback (most recent call last):
  File "train_mnist.py", line 76, in <module>
    fit.fit(args, sym, get_mnist_iter)
  File "/home/ubuntu/Research/code/mxnet/example/image-classification/common/fit.py", line 97, in fit
    kv = mx.kvstore.create(args.kv_store)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/kvstore.py", line 403, in create
    ctypes.byref(handle)))
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/base.py", line 77, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 363, in <lambda>
    target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1

您看到错误是因为您启动作业的节点无权访问其他工作程序。 launch.py 默认使用 ssh 与工作人员通信。您需要设置从本地计算机到节点(主节点)的 ssh 代理转发设置,这些凭据可用于通过 ssh 与集群中的其他节点通信。

on MAC OS,-A 将启用凭据转发

ssh -A -i key.pem ubuntu@ip number.

您可能需要考虑使用 Deep Learning CloudFormation template 创建堆栈并向您展示设置 ssh-agent 转发和 CIFAR-10 示例到 运行 分布式训练的步骤。