[HTCONDOR][kubernetes / k8s]:无法在 k8s 内启动 minicondor 图像 - condor_master 不工作
[HTCONDOR][kubernetes / k8s] : Unable to start minicondor image within k8s - condor_master not working
POST 编辑
问题是由于:
PSP
(Pod security policy
) 默认情况下,我的 condor
用户不允许升级。这就是它不起作用的原因。因为 supervisord
是 运行 作为 root
用户并尝试以 root
而不是其他用户(即 condor
)编写日志并启动 condor 收集器
描述
mini-condor
基础映像未在 kubernetes rancher pod 上按预期启动。
我正在使用:
- 这张图片:https://hub.docker.com/r/htcondor/mini在 rancher (k8s) 的自定义命名空间中
ps : the image was working perfectly on :
- a local env
- minikube default installation
我是运行它作为一个简单的部署:
当 pod 启动时,Kubernetes 默认日志文件显示:
2021-09-15 09:26:36,908 INFO supervisord started with pid 1
2021-09-15 09:26:37,911 INFO spawned: 'condor_master' with pid 20
2021-09-15 09:26:37,912 INFO spawned: 'condor_restd' with pid 21
2021-09-15 09:26:37,917 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:37,924 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:38,926 INFO spawned: 'condor_master' with pid 22
2021-09-15 09:26:38,928 INFO spawned: 'condor_restd' with pid 23
2021-09-15 09:26:38,932 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:38,936 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:40,939 INFO spawned: 'condor_master' with pid 24
2021-09-15 09:26:40,943 INFO spawned: 'condor_restd' with pid 25
2021-09-15 09:26:40,947 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:40,948 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:43,953 INFO spawned: 'condor_master' with pid 26
2021-09-15 09:26:43,955 INFO spawned: 'condor_restd' with pid 27
2021-09-15 09:26:43,959 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:43,968 INFO gave up: condor_restd entered FATAL state, too many start retries too quickly
2021-09-15 09:26:43,969 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:44,970 INFO gave up: condor_master entered FATAL state, too many start retries too quickly
这里是一个简短的命令和输出结果:
CMD
output
condor_status
CEDAR:6001:Failed to connect to <127.0.0.1:9618>
condor_master
ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`
1)首先尝试解决问题
我决定自定义图片,但是错误还是一样
用于尝试修复权限问题的 docker 图片
- 图片:
FROM htcondor/mini:9.2-el7
RUN condor_master
RUN chown condor:root /var/
RUN chown condor:root /var/log
RUN chown -R condor:root /var/log/
RUN chown -R condor:condor /var/log/condor
RUN chown condor:condor /var/log/condor/ProcLog
RUN chown condor:condor /var/log/condor/MasterLog
RUN chmod 775 -R /var/
- Kubernetes - 牧场主
- yaml 文件:
apiVersion: apps/v1
kind: Deployment
metadata:
name: htcondor-mini--all-in-one
namespace: grafana-exporter
spec:
containers:
- image: <custom_image>
imagePullPolicy: Always
name: htcondor-mini--all-in-one
resources: {}
securityContext:
capabilities: {}
stdin: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
tty: true
dnsConfig: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
这里是一个简短的命令和输出结果:
CMD
output
condor_status
CEDAR:6001:Failed to connect to <127.0.0.1:9618>
condor_master
ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`
ls -ld /var/
drwxrwxr-x 1 condor root 17 Nov 13 2020 /var/
ls -ld /var/log/
drwxrwxr-x 1 condor root 65 Oct 7 11:54 /var/log/
ls -ld /var/log/condor
drwxrwxr-x 1 condor condor 240 Oct 7 11:23 /var/log/condor
ls -ld /var/log/condor/MasterLog
-rwxrwxr-x 1 condor condor 3243 Oct 7 11:23 /var/log/condor/MasterLog
主日志内容:
10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 ** condor_master (CONDOR_MASTER) STARTING UP
10/07/21 11:23:21 ** /usr/sbin/condor_master
10/07/21 11:23:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
10/07/21 11:23:21 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
10/07/21 11:23:21 ** $CondorVersion: 9.2.0 Sep 23 2021 BuildID: 557262 PackageID: 9.2.0-1 $
10/07/21 11:23:21 ** $CondorPlatform: x86_64_CentOS7 $
10/07/21 11:23:21 ** PID = 7
10/07/21 11:23:21 ** Log last touched time unavailable (No such file or directory)
10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 Using config source: /etc/condor/condor_config
10/07/21 11:23:21 Using local config sources:
10/07/21 11:23:21 /etc/condor/config.d/00-htcondor-9.0.config
10/07/21 11:23:21 /etc/condor/config.d/00-minicondor
10/07/21 11:23:21 /etc/condor/config.d/01-misc.conf
10/07/21 11:23:21 /etc/condor/condor_config.local
10/07/21 11:23:21 config Macros = 73, Sorted = 73, StringBytes = 1848, TablesBytes = 2692
10/07/21 11:23:21 CLASSAD_CACHING is OFF
10/07/21 11:23:21 Daemon Log is logging: D_ALWAYS D_ERROR
10/07/21 11:23:21 SharedPortEndpoint: waiting for connections to named socket master_7_43af
10/07/21 11:23:21 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
10/07/21 11:23:21 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
10/07/21 11:23:21 Permission denied error during DISCARD_SESSION_KEYRING_ON_STARTUP, continuing anyway
10/07/21 11:23:21 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
10/07/21 11:23:21 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port
10/07/21 11:23:21 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1632433213)
10/07/21 11:23:21 Cannot remove wait-for-startup file /var/lock/condor/shared_port_ad
10/07/21 11:23:21 WARNING: forward resolution of ip6-localhost doesn't match 127.0.0.1!
10/07/21 11:23:21 WARNING: forward resolution of ip6-loopback doesn't match 127.0.0.1!
10/07/21 11:23:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 9
10/07/21 11:23:22 Waiting for /var/lock/condor/shared_port_ad to appear.
10/07/21 11:23:22 Found /var/lock/condor/shared_port_ad.
10/07/21 11:23:22 Cannot remove wait-for-startup file /var/log/condor/.collector_address
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 10
10/07/21 11:23:23 Waiting for /var/log/condor/.collector_address to appear.
10/07/21 11:23:23 Found /var/log/condor/.collector_address.
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 11
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 12
10/07/21 11:23:24 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 15
10/07/21 11:23:24 Daemons::StartAllDaemons all daemons were started
非常感谢阅读。希望它能帮助很多其他人。
问题原因
问题是由于:
PSP policy
(Pod 安全策略)
默认情况下,我的 condor 用户不允许升级。
解决方案
我目前找到的最佳解决方案是 运行 以 condor 用户身份进行所有操作,并向 condor 用户授予权限。为此,您需要:
- 在
supervisord.conf
中:运行 主管作为 condor
用户
- 在
supervisord.conf
中:运行 日志和套接字在 /tmp
- 在
Dockerfile
中:通过 condor
更改大部分文件夹的所有者
- 在
deployment.yaml
中将ID
设置为64
(神鹰用户)
Dockerfile
FROM htcondor/mini:9.2-el7
# SET WORKDIR
WORKDIR /home/condor/
RUN chown condor:condor /home/condor
# COPY SUPERVISOR
COPY supervisord.conf /etc/supervisord.conf
# Need to run the cmd to create all dir
RUN condor_master
# FIX PERMISSION ISSUES FOR RANCHER
RUN chown -R condor:condor /var/log/ /tmp &&\
chown -R restd:restd /home/restd &&\
chmod 755 -R /home/restd
supervisord.conf
:
[supervisord]
user=condor
nodaemon=true
logfile = /tmp/supervisord.log
directory = /tmp
pidfile = /tmp/supervisord.pid
childlogdir = /tmp
# next 3 sections contain using supervisorctl to manage daemons
[unix_http_server]
file=/tmp/supervisord.sock
chown=condor:condor
chmod=0777
user=condor
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock
[program:condor_master]
user=condor
command=/usr/sbin/condor_master -f
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile = /var/log/condor_master.log
stderr_logfile = /var/log/condor_master.error.log
deployment.yaml
apiVersion: apps/v1
kind: Deployment
spec:
containers:
- image: <condor-image>
imagePullPolicy: Always
name: htcondor-exporter
ports:
- containerPort: 8080
name: myport
protocol: TCP
resources: {}
securityContext:
capabilities: {}
runAsNonRoot: false
runAsUser: 64
stdin: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
tty: true
POST 编辑
问题是由于:
PSP
(Pod security policy
) 默认情况下,我的 condor
用户不允许升级。这就是它不起作用的原因。因为 supervisord
是 运行 作为 root
用户并尝试以 root
而不是其他用户(即 condor
)编写日志并启动 condor 收集器
描述
mini-condor
基础映像未在 kubernetes rancher pod 上按预期启动。
我正在使用:
- 这张图片:https://hub.docker.com/r/htcondor/mini在 rancher (k8s) 的自定义命名空间中
ps : the image was working perfectly on :
- a local env
- minikube default installation
我是运行它作为一个简单的部署:
当 pod 启动时,Kubernetes 默认日志文件显示:
2021-09-15 09:26:36,908 INFO supervisord started with pid 1
2021-09-15 09:26:37,911 INFO spawned: 'condor_master' with pid 20
2021-09-15 09:26:37,912 INFO spawned: 'condor_restd' with pid 21
2021-09-15 09:26:37,917 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:37,924 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:38,926 INFO spawned: 'condor_master' with pid 22
2021-09-15 09:26:38,928 INFO spawned: 'condor_restd' with pid 23
2021-09-15 09:26:38,932 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:38,936 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:40,939 INFO spawned: 'condor_master' with pid 24
2021-09-15 09:26:40,943 INFO spawned: 'condor_restd' with pid 25
2021-09-15 09:26:40,947 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:40,948 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:43,953 INFO spawned: 'condor_master' with pid 26
2021-09-15 09:26:43,955 INFO spawned: 'condor_restd' with pid 27
2021-09-15 09:26:43,959 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:43,968 INFO gave up: condor_restd entered FATAL state, too many start retries too quickly
2021-09-15 09:26:43,969 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:44,970 INFO gave up: condor_master entered FATAL state, too many start retries too quickly
这里是一个简短的命令和输出结果:
CMD | output |
---|---|
condor_status |
CEDAR:6001:Failed to connect to <127.0.0.1:9618> |
condor_master |
ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp` |
1)首先尝试解决问题
我决定自定义图片,但是错误还是一样
用于尝试修复权限问题的 docker 图片
- 图片:
FROM htcondor/mini:9.2-el7
RUN condor_master
RUN chown condor:root /var/
RUN chown condor:root /var/log
RUN chown -R condor:root /var/log/
RUN chown -R condor:condor /var/log/condor
RUN chown condor:condor /var/log/condor/ProcLog
RUN chown condor:condor /var/log/condor/MasterLog
RUN chmod 775 -R /var/
- Kubernetes - 牧场主
- yaml 文件:
apiVersion: apps/v1
kind: Deployment
metadata:
name: htcondor-mini--all-in-one
namespace: grafana-exporter
spec:
containers:
- image: <custom_image>
imagePullPolicy: Always
name: htcondor-mini--all-in-one
resources: {}
securityContext:
capabilities: {}
stdin: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
tty: true
dnsConfig: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
这里是一个简短的命令和输出结果:
CMD | output |
---|---|
condor_status |
CEDAR:6001:Failed to connect to <127.0.0.1:9618> |
condor_master |
ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp` |
ls -ld /var/ |
drwxrwxr-x 1 condor root 17 Nov 13 2020 /var/ |
ls -ld /var/log/ |
drwxrwxr-x 1 condor root 65 Oct 7 11:54 /var/log/ |
ls -ld /var/log/condor |
drwxrwxr-x 1 condor condor 240 Oct 7 11:23 /var/log/condor |
ls -ld /var/log/condor/MasterLog |
-rwxrwxr-x 1 condor condor 3243 Oct 7 11:23 /var/log/condor/MasterLog |
主日志内容:
10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 ** condor_master (CONDOR_MASTER) STARTING UP
10/07/21 11:23:21 ** /usr/sbin/condor_master
10/07/21 11:23:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
10/07/21 11:23:21 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
10/07/21 11:23:21 ** $CondorVersion: 9.2.0 Sep 23 2021 BuildID: 557262 PackageID: 9.2.0-1 $
10/07/21 11:23:21 ** $CondorPlatform: x86_64_CentOS7 $
10/07/21 11:23:21 ** PID = 7
10/07/21 11:23:21 ** Log last touched time unavailable (No such file or directory)
10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 Using config source: /etc/condor/condor_config
10/07/21 11:23:21 Using local config sources:
10/07/21 11:23:21 /etc/condor/config.d/00-htcondor-9.0.config
10/07/21 11:23:21 /etc/condor/config.d/00-minicondor
10/07/21 11:23:21 /etc/condor/config.d/01-misc.conf
10/07/21 11:23:21 /etc/condor/condor_config.local
10/07/21 11:23:21 config Macros = 73, Sorted = 73, StringBytes = 1848, TablesBytes = 2692
10/07/21 11:23:21 CLASSAD_CACHING is OFF
10/07/21 11:23:21 Daemon Log is logging: D_ALWAYS D_ERROR
10/07/21 11:23:21 SharedPortEndpoint: waiting for connections to named socket master_7_43af
10/07/21 11:23:21 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
10/07/21 11:23:21 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
10/07/21 11:23:21 Permission denied error during DISCARD_SESSION_KEYRING_ON_STARTUP, continuing anyway
10/07/21 11:23:21 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
10/07/21 11:23:21 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port
10/07/21 11:23:21 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1632433213)
10/07/21 11:23:21 Cannot remove wait-for-startup file /var/lock/condor/shared_port_ad
10/07/21 11:23:21 WARNING: forward resolution of ip6-localhost doesn't match 127.0.0.1!
10/07/21 11:23:21 WARNING: forward resolution of ip6-loopback doesn't match 127.0.0.1!
10/07/21 11:23:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 9
10/07/21 11:23:22 Waiting for /var/lock/condor/shared_port_ad to appear.
10/07/21 11:23:22 Found /var/lock/condor/shared_port_ad.
10/07/21 11:23:22 Cannot remove wait-for-startup file /var/log/condor/.collector_address
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 10
10/07/21 11:23:23 Waiting for /var/log/condor/.collector_address to appear.
10/07/21 11:23:23 Found /var/log/condor/.collector_address.
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 11
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 12
10/07/21 11:23:24 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 15
10/07/21 11:23:24 Daemons::StartAllDaemons all daemons were started
非常感谢阅读。希望它能帮助很多其他人。
问题原因
问题是由于:
PSP policy
(Pod 安全策略)
默认情况下,我的 condor 用户不允许升级。
解决方案
我目前找到的最佳解决方案是 运行 以 condor 用户身份进行所有操作,并向 condor 用户授予权限。为此,您需要:
- 在
supervisord.conf
中:运行 主管作为condor
用户 - 在
supervisord.conf
中:运行 日志和套接字在/tmp
- 在
Dockerfile
中:通过condor
更改大部分文件夹的所有者
- 在
deployment.yaml
中将ID
设置为64
(神鹰用户)
Dockerfile
FROM htcondor/mini:9.2-el7
# SET WORKDIR
WORKDIR /home/condor/
RUN chown condor:condor /home/condor
# COPY SUPERVISOR
COPY supervisord.conf /etc/supervisord.conf
# Need to run the cmd to create all dir
RUN condor_master
# FIX PERMISSION ISSUES FOR RANCHER
RUN chown -R condor:condor /var/log/ /tmp &&\
chown -R restd:restd /home/restd &&\
chmod 755 -R /home/restd
supervisord.conf
:
[supervisord]
user=condor
nodaemon=true
logfile = /tmp/supervisord.log
directory = /tmp
pidfile = /tmp/supervisord.pid
childlogdir = /tmp
# next 3 sections contain using supervisorctl to manage daemons
[unix_http_server]
file=/tmp/supervisord.sock
chown=condor:condor
chmod=0777
user=condor
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock
[program:condor_master]
user=condor
command=/usr/sbin/condor_master -f
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile = /var/log/condor_master.log
stderr_logfile = /var/log/condor_master.error.log
deployment.yaml
apiVersion: apps/v1
kind: Deployment
spec:
containers:
- image: <condor-image>
imagePullPolicy: Always
name: htcondor-exporter
ports:
- containerPort: 8080
name: myport
protocol: TCP
resources: {}
securityContext:
capabilities: {}
runAsNonRoot: false
runAsUser: 64
stdin: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
tty: true