Systemd 服务处于非活动状态(已死),但仅在数周后

Systemd service is inactive (dead), but only after many weeks

我有一个自定义的 systemd 服务,它使用 inotify 扫描文件系统并在某些事件发生时创建文件。

该服务可以正常工作很多天,有时甚至几个星期。然后突然停止了。它配置为使用 Restart=always,因此我希望该服务能够在失败时自行恢复,但这并没有发生。

我想知道如何确定服务无法自行恢复的原因以及如何解决该问题。

这是服务配置:

[Unit]
Description=Sets a PID limit (pids.max) for each container in the docker host
After=docker.service
Wants=docker.service

[Service]
Type=simple
Restart=always
StartLimitInterval=0
RestartSec=5
ExecStart=/opt/scripts/container-pid-limit.sh
StandardError=journal

以及文件内容/opt/scripts/container-pid-limit.sh

#!/bin/bash -x
MAX_PIDS=5000
CGROUPS_DIR=/sys/fs/cgroup/pids/docker/
CONTAINERS_DIR=/srv/docker_root/containers/

set_limit() {
  limit=$(grep -ir label $CONTAINERS_DIR//config.v2.json | jq -r '.Config.Labels["com.xyz.pid_limit"]')

  if [[ ! $limit -gt 0 ]] ; then
    limit=$MAX_PIDS
  fi

  echo "CONTAINER: $c LIMIT $limit FILE $f"
  echo $limit > $f;
}

# set pids.max for already created containers
for f in $(find $CGROUPS_DIR -mindepth 2 -name pids.max); do
  c=$(dirname $f | xargs basename)
  set_limit $c
done

# monitor cgroup dir for newly created dirs
inotifywait --event create,isdir --monitor --quiet --format "%w%f" $CGROUPS_DIR | while read -r line; do
  c=$(basename $line)
  set_limit $c
done

失败前 systemctl status 的示例输出:

● container-pid-limit.service - Sets a PID limit (pids.max) for each container in the docker host
   Loaded: loaded (/etc/systemd/system/container-pid-limit.service; static; vendor preset: enabled)
   Active: active (running) since Wed 2019-06-05 08:44:38 UTC; 14min ago
 Main PID: 277527 (container-pid-l)
    Tasks: 3
   Memory: 2.3M
      CPU: 79ms
   CGroup: /system.slice/container-pid-limit.service
           ├─277527 /bin/bash /opt/scripts/container-pid-limit.sh
           ├─277892 inotifywait --event create,isdir --monitor --quiet --format %w%f /sys/fs/cgroup/pids/docker/
           └─277893 /bin/bash /opt/scripts/container-pid-limit.sh

失败后 systemctl status 的示例输出:

● container-pid-limit.service - Sets a PID limit (pids.max) for each container in the docker host
   Loaded: loaded (/etc/systemd/system/container-pid-limit.service; static; vendor preset: enabled)
   Active: inactive (dead)

编辑:我正在尝试使用 systemctl statussystemctl show 来确定服务何时启动并最终停止,但在我看来,当服务失败时,所有历史记录都将丢失:

参考: https://unix.stackexchange.com/questions/368767/how-do-i-see-when-a-systemd-service-was-started-stopped-restarted

systemctl show 的示例输出:

Type=simple
Restart=always
NotifyAccess=none
RestartUSec=5s
TimeoutStartUSec=1min
TimeoutStopUSec=45s
RuntimeMaxUSec=infinity
WatchdogUSec=0
WatchdogTimestampMonotonic=0
FailureAction=none
PermissionsStartOnly=no
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
MainPID=0
ControlPID=0
FileDescriptorStoreMax=0
NFileDescriptorStore=0
StatusErrno=0
Result=success
ExecMainStartTimestampMonotonic=0
ExecMainExitTimestampMonotonic=0
ExecMainPID=0
ExecMainCode=0
ExecMainStatus=0
ExecStart={ path=/opt/scripts/container-pid-limit.sh ; argv[]=/opt/scripts//container-pid-limit.sh ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
Slice=system.slice
MemoryCurrent=18446744073709551615
CPUUsageNSec=18446744073709551615
TasksCurrent=18446744073709551615
Delegate=no
CPUAccounting=no
CPUShares=18446744073709551615
StartupCPUShares=18446744073709551615
CPUQuotaPerSecUSec=infinity
BlockIOAccounting=no
BlockIOWeight=18446744073709551615
StartupBlockIOWeight=18446744073709551615
MemoryAccounting=no
MemoryLimit=18446744073709551615
DevicePolicy=auto
TasksAccounting=no
TasksMax=18446744073709551615
UMask=0022
LimitCPU=18446744073709551615
LimitCPUSoft=18446744073709551615
LimitFSIZE=18446744073709551615
LimitFSIZESoft=18446744073709551615
LimitDATA=18446744073709551615
LimitDATASoft=18446744073709551615
LimitSTACK=18446744073709551615
LimitSTACKSoft=8388608
LimitCORE=18446744073709551615
LimitCORESoft=0
LimitRSS=18446744073709551615
LimitRSSSoft=18446744073709551615
LimitNOFILE=4096
LimitNOFILESoft=1024
LimitAS=18446744073709551615
LimitASSoft=18446744073709551615
LimitNPROC=7869937
LimitNPROCSoft=7869937
LimitMEMLOCK=65536
LimitMEMLOCKSoft=65536
LimitLOCKS=18446744073709551615
LimitLOCKSSoft=18446744073709551615
LimitSIGPENDING=7869937
LimitSIGPENDINGSoft=7869937
LimitMSGQUEUE=819200
LimitMSGQUEUESoft=819200
LimitNICE=0
LimitNICESoft=0
LimitRTPRIO=0
LimitRTPRIOSoft=0
LimitRTTIME=18446744073709551615
LimitRTTIMESoft=18446744073709551615
OOMScoreAdjust=0
Nice=0
IOScheduling=0
CPUSchedulingPolicy=0
CPUSchedulingPriority=0
TimerSlackNSec=50000
CPUSchedulingResetOnFork=no
NonBlocking=no
StandardInput=null
StandardOutput=journal
StandardError=journal
TTYReset=no
TTYVHangup=no
TTYVTDisallocate=no
SyslogPriority=30
SyslogLevelPrefix=yes
SyslogLevel=6
SyslogFacility=3
SecureBits=0
CapabilityBoundingSet=18446744073709551615
AmbientCapabilities=0
MountFlags=0
PrivateTmp=no
PrivateNetwork=no
PrivateDevices=no
ProtectHome=no
ProtectSystem=no
SameProcessGroup=no
UtmpMode=init
IgnoreSIGPIPE=yes
NoNewPrivileges=no
SystemCallErrorNumber=0
RuntimeDirectoryMode=0755
KillMode=control-group
KillSignal=15
SendSIGKILL=yes
SendSIGHUP=no
Id=container-pid-limit.service
Names=container-pid-limit.service
Requires=sysinit.target system.slice
Wants=docker.service
Conflicts=shutdown.target
Before=shutdown.target
After=basic.target systemd-journald.socket system.slice docker.service sysinit.target
Description=Sets a PID limit (pids.max) for each container in the docker host
LoadState=loaded
ActiveState=inactive
SubState=dead
FragmentPath=/etc/systemd/system/container-pid-limit.service
UnitFileState=static
UnitFilePreset=enabled
StateChangeTimestampMonotonic=0
InactiveExitTimestampMonotonic=0
ActiveEnterTimestampMonotonic=0
ActiveExitTimestampMonotonic=0
InactiveEnterTimestampMonotonic=0
CanStart=yes
CanStop=yes
CanReload=no
CanIsolate=no
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=yes
OnFailureJobMode=replace
IgnoreOnIsolate=no
NeedDaemonReload=no
JobTimeoutUSec=infinity
JobTimeoutAction=none
ConditionResult=no
AssertResult=no
ConditionTimestampMonotonic=0
AssertTimestampMonotonic=0
Transient=no
StartLimitInterval=0
StartLimitBurst=5
StartLimitAction=none

SystemD Restart-Always 并不是一个循环。大约 failure-handling.

StartLimitBurst 上的 RTFM