如何帮助 condor 找到它应该在作业中执行的文件?
How to help condor find the file it should execute in a job?
我正在尝试 运行 一份工作,但 condor 似乎找不到我的文件。
我已经确定:
- 通过在其绝对路径上执行 ls 和 cat,文件就在那里
- 运行 来自神鹰互动环节
- 赋予它正确的权限,以便它 运行 就可以了。
我已经这样做了,但是我收到了这个错误:
(automl-meta-learning) miranda9~/automl-meta-learning/automl-proj/experiments/meta_learning $ cat condor_job_log_69.out
000 (069.000.000) 10/21 11:06:06 Job submitted from host: <130.126.112.32:9618?addrs=130.126.112.32-9618+[--1]-9618&noUDP&sock=3715279_f2e6_4>
...
001 (069.000.000) 10/21 11:06:07 Job executing on host: <172.22.224.111:9618?addrs=172.22.224.111-9618+[--1]-9618&noUDP&sock=807_1d04_3>
...
007 (069.000.000) 10/21 11:06:07 Shadow exception!
Error from slot1_3@vision-01.cs.illinois.edu: Failed to execute '/home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py': (errno=2: 'No such file or directory')
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
012 (069.000.000) 10/21 11:06:07 Job was held.
Error from slot1_3@vision-01.cs.illinois.edu: Failed to execute '/home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py': (errno=2: 'No such file or directory')
Code 6 Subcode 2
...
(automl)
但文件显然在那里:
(automl-meta-learning) miranda9~/automl-meta-learning/automl-proj/experiments/meta_learning $ ls -lah /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
-rwxrwxr-x. 1 miranda9 miranda9 22K Oct 20 14:54 /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
我不明白为什么神鹰找不到。有任何想法吗?我不是系统管理员,所以我什至不知道如何开始调试它。
顺便说一下我的提交脚本:
####################
#
# Experiments script
# Simple HTCondor submit description file
#
# reference: https://gitlab.engr.illinois.edu/Vision/vision-gpu-servers/-/wikis/HTCondor-user-guide#submit-jobs
#
# chmod a+x test_condor.py
# chmod a+x experiments_meta_model_optimization.py
# chmod a+x meta_learning_experiments_submission.py
# chmod a+x download_miniImagenet.py
#
# condor_submit -i
# condor_submit job.sub
#
####################
# Executable = meta_learning_experiments_submission.py
# Executable = automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
# Executable = ~/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
Executable = /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
## Output Files
Log = condor_job.$(CLUSTER).log.out
Output = condor_job.$(CLUSTER).stdout.out
Error = condor_job.$(CLUSTER).err.out
# Use this to make sure 1 gpu is available. The key words are case insensitive.
REquest_gpus = 1
# requirements = ((CUDADeviceName = "Tesla K40m")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.gpus >= Requestgpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
# requirements = (CUDADeviceName == "Tesla K40m")
# requirements = (CUDADeviceName == "Quadro RTX 6000")
requirements = (CUDADeviceName != "Tesla K40m")
# Note: to use multiple CPUs instead of the default (one CPU), use request_cpus as well
Request_cpus = 8
# E-mail option
Notify_user = me@gmail.com
Notification = always
Environment = MY_CONDOR_JOB_ID= $(CLUSTER)
# "Queue" means add the setup until this line to the queue (needs to be at the end of script).
Queue
看起来您的可执行文件是一个 python 脚本。 Linux 当脚本本身存在时,将报告“没有这样的文件或目录”,但是在“#!”上列出的解释器系统上不存在。这可能是这里发生的事情吗?这个脚本的第一行是什么样的?
问题在于,在我的 python 提交脚本的顶部,我有其他与 condor 无关的集群的参数,因此 python 可执行文件的路径错误。我通过删除它并将此行添加到我的 python 提交脚本来修复它:
#!/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python
事实上,要找到您当前环境的 python 路径,请执行以下操作:
which python
我正在尝试 运行 一份工作,但 condor 似乎找不到我的文件。
我已经确定:
- 通过在其绝对路径上执行 ls 和 cat,文件就在那里
- 运行 来自神鹰互动环节
- 赋予它正确的权限,以便它 运行 就可以了。
我已经这样做了,但是我收到了这个错误:
(automl-meta-learning) miranda9~/automl-meta-learning/automl-proj/experiments/meta_learning $ cat condor_job_log_69.out
000 (069.000.000) 10/21 11:06:06 Job submitted from host: <130.126.112.32:9618?addrs=130.126.112.32-9618+[--1]-9618&noUDP&sock=3715279_f2e6_4>
...
001 (069.000.000) 10/21 11:06:07 Job executing on host: <172.22.224.111:9618?addrs=172.22.224.111-9618+[--1]-9618&noUDP&sock=807_1d04_3>
...
007 (069.000.000) 10/21 11:06:07 Shadow exception!
Error from slot1_3@vision-01.cs.illinois.edu: Failed to execute '/home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py': (errno=2: 'No such file or directory')
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
012 (069.000.000) 10/21 11:06:07 Job was held.
Error from slot1_3@vision-01.cs.illinois.edu: Failed to execute '/home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py': (errno=2: 'No such file or directory')
Code 6 Subcode 2
...
(automl)
但文件显然在那里:
(automl-meta-learning) miranda9~/automl-meta-learning/automl-proj/experiments/meta_learning $ ls -lah /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
-rwxrwxr-x. 1 miranda9 miranda9 22K Oct 20 14:54 /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
我不明白为什么神鹰找不到。有任何想法吗?我不是系统管理员,所以我什至不知道如何开始调试它。
顺便说一下我的提交脚本:
####################
#
# Experiments script
# Simple HTCondor submit description file
#
# reference: https://gitlab.engr.illinois.edu/Vision/vision-gpu-servers/-/wikis/HTCondor-user-guide#submit-jobs
#
# chmod a+x test_condor.py
# chmod a+x experiments_meta_model_optimization.py
# chmod a+x meta_learning_experiments_submission.py
# chmod a+x download_miniImagenet.py
#
# condor_submit -i
# condor_submit job.sub
#
####################
# Executable = meta_learning_experiments_submission.py
# Executable = automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
# Executable = ~/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
Executable = /home/miranda9/automl-meta-learning/automl-proj/experiments/meta_learning/meta_learning_experiments_submission.py
## Output Files
Log = condor_job.$(CLUSTER).log.out
Output = condor_job.$(CLUSTER).stdout.out
Error = condor_job.$(CLUSTER).err.out
# Use this to make sure 1 gpu is available. The key words are case insensitive.
REquest_gpus = 1
# requirements = ((CUDADeviceName = "Tesla K40m")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.gpus >= Requestgpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
# requirements = (CUDADeviceName == "Tesla K40m")
# requirements = (CUDADeviceName == "Quadro RTX 6000")
requirements = (CUDADeviceName != "Tesla K40m")
# Note: to use multiple CPUs instead of the default (one CPU), use request_cpus as well
Request_cpus = 8
# E-mail option
Notify_user = me@gmail.com
Notification = always
Environment = MY_CONDOR_JOB_ID= $(CLUSTER)
# "Queue" means add the setup until this line to the queue (needs to be at the end of script).
Queue
看起来您的可执行文件是一个 python 脚本。 Linux 当脚本本身存在时,将报告“没有这样的文件或目录”,但是在“#!”上列出的解释器系统上不存在。这可能是这里发生的事情吗?这个脚本的第一行是什么样的?
问题在于,在我的 python 提交脚本的顶部,我有其他与 condor 无关的集群的参数,因此 python 可执行文件的路径错误。我通过删除它并将此行添加到我的 python 提交脚本来修复它:
#!/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python
事实上,要找到您当前环境的 python 路径,请执行以下操作:
which python