Bash/AWS CLI:试图找出在通过 AWS SSM 重启 EC2 实例组时如何通过 2 次检查来验证一组正常运行时间

Bash/AWS CLI: Trying to figure out how to validate an array of uptimes with 2 checks when rebooting groups of EC2 instances through AWS SSM

我一直在努力寻找解决 bash 脚本问题的最佳方法。我有一个命令可以在几分钟内检查服务器组的正常运行时间。我只想在所有服务器都启动 5 分钟后继续进行下一组重新启动,但我也想验证它们没有启动超过一个小时以防重新启动不需要。

我最初试图设置一个 while 循环,它会不断发出命令来检查正常运行时间并将输出发送到数组中。我想弄清楚如何循环遍历一个数组,直到该数组的所有元素都大于 5 且小于 5。我什至在大于 5 的第一次检查中都没有成功。是否有可能连续写入数组并对数组中的每个值执行算术检查,以便在 while 循环中所有值都必须大于 X?将当前正常运行时间放入数组的服务器数量因组而异,因此数组中的值数量不会始终相同。

数组是执行此操作的正确方法吗?我会提供到目前为止我尝试过的例子,但这是一个巨大的混乱,我认为从头开始只是要求输入可能是最好的开始。

命令 I am 运行 pull uptimes 的输出类似于以下内容:

1
2
1
4
3
2

编辑

由于提供的帮助,我能够为此共同获得功能概念证明,我很兴奋。以防万一它可以帮助将来尝试做类似事情的任何人。手头的问题是我们使用 AWS SSM 进行所有 Windows 服务器修补,并且很多次 SSM 告诉服务器在修补 SSM 代理后重新启动需要很长时间才能签入。这减慢了我们现在的整个过程在几十个补丁组中是相当手动的。很多时候,我们必须去手动验证服务器确实在我们从 SSM 告诉它重新启动后确实重新启动,以便我们知道我们可以开始为下一个补丁组重新启动。有了这个,我们将能够发布一个脚本,以正确的顺序为我们的补丁组重新启动,并在继续下一个组之前验证服务器是否已正确重新启动。

#!/bin/bash

### The purpose of this script is to automate the execution of commands required to reboot groups of AWS Windows servers utilizing SSM while also verifying their uptime and only continuing on to the next group once the previous has reached X # of minutes. This solves the problems of AWS SSM Agents not properly checking in with SSM post-reboot.

patchGroups=(01 02 03)                      # array containing the values of the RebootGroup tag


for group in "${patchGroups[@]}"
do
    printf "Rebooting Patch Group %q\n" "$group"
    aws ec2 reboot-instances --instance-ids `aws ec2 describe-instances --filters "Name=tag:RebootGroup,Values=$group" --query 'Reservations[].Instances[].InstanceId' --output text`

    sleep 2m

    unset      passed failed serverList                      # wipe arrays
    declare -A passed failed serverList                      # declare associative arrays

    serverList=$(aws ec2 describe-instances --filter "Name=tag:RebootGroup,Values=$group" --query 'Reservations[*].Instances[*].[InstanceId]' --output text)

    for server in ${serverList}                  # loop through list of servers
    do
        failed["${server}"]=0                     # add to the failed[] array
    done

    while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
    do
        for server in "${!failed[@]}"             # loop through servers in the failed[] array
        do
            ssmID=$(aws ssm send-command --document-name "AWS-RunPowerShellScript" --document-version "1" --targets "[{\"Key\":\"InstanceIds\",\"Values\":[\"$server\"]}]" --parameters '{"commands":["$wmi = Get-WmiObject -Class Win32_OperatingSystem ","$uptimeMinutes =    ($wmi.ConvertToDateTime($wmi.LocalDateTime)-$wmi.ConvertToDateTime($wmi.LastBootUpTime) | select-object -expandproperty \"TotalMinutes\")","[int]$uptimeMinutes"],"workingDirectory":[""],"executionTimeout":["3600"]}' --timeout-seconds 600 --max-concurrency    "50" --max-errors "0" --region us-west-2 --output text --query "Command.CommandId")

            sleep 5

            uptime=$(aws ssm list-command-invocations --command-id "$ssmID" --details --query 'CommandInvocations[].CommandPlugins[].Output' --output text | sed 's/\r$//')

            printf "Checking instance ID %q\n" "$server"
            printf "Value of uptime is = %q\n" "$uptime"

            # if uptime is within our 'success' window then move server to passed[] array

            if [[ "${uptime}" -ge 3 && "${uptime}" -lt 60 ]] 
            then
                passed["${server}"]="${uptime}"   # add to passed[] array
                printf "Server with instance ID %q has successfully rebooted.\n" "$server"
                unset failed["${server}"]         # remove from failed[] array
            fi
        done

        # display current status (edit/remove as desired)

        printf "\n++++++++++++++ successful reboots\n"
        printf "%s\n" "${!passed[@]}" | sort -n

        printf "\n++++++++++++++ failed reboot\n"

        for server in ${!failed[@]}
        do
            printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
        done | sort -n

        printf "\n"

        sleep 60                            # adjust as necessary
    done
done

听起来您需要不断重新评估正常运行时间的输出才能获得所需的数据,因此数组或其他变量可能只会让您陷入困境。从功能上考虑这一点(如 functions)。您需要一个函数来检查正常运行时间是否在您想要的范围内,只需 一次。然后,您需要 运行 定期运行该功能。如果成功,您将触发重启。如果失败了,你让它稍后再试。

考虑这段代码:

uptime_in_bounds() {
    local min=""
    local max=""
    local uptime_secs

    # The first value in /proc/uptime is the number of seconds the
    # system has been up. We have to truncate it to an integer…
    read -r uptime_float _ < /proc/uptime
    uptime_secs="${uptime_float%.*}"

    # A shell function reflects the exit status of its last command.
    # This function "succeeds" if the uptime_secs is between min and max.
    (( min < uptime_secs && max > uptime_secs ))
}
if uptime_in_bounds 300 3600; then
    sudo reboot  # or whatever
fi

总体思路...可能需要根据 OP 跟踪服务器的方式、获取正常运行时间等进行一些调整...

# for a given set of servers, and assuming stored in variable ${server_list} ...

unset      passed failed                      # wipe arrays
declare -A passed failed                      # declare associative arrays

for server in ${server_list}                  # loop through list of servers
do
    failed["${server}"]=0                     # add to the failed[] array
done

while [[ "${#failed[@]}" -gt 0 ]]             # loop while number of servers in the failed[] array is greater than 0
do
    for server in "${!failed[@]}"             # loop through servers in the failed[] array
    do
        uptime=$( some_command_to_get_uptime_for_server "${server}" )

        # if uptime is within our 'success' window then move server to passed[] array

        if [[ "${uptime}" -gt 5 && "${uptime}" -lt 60 ]] 
        then
            passed["${server}"]="${uptime}"   # add to passed[] array
            unset failed["${server}"]         # remove from failed[] array
        else
            failed["${server}"]="${uptime}"
        fi
    done

    # display current status (edit/remove as desired)

    printf "\n++++++++++++++ successful reboots\n"
    printf "%s\n" "${!passed[@]}" | sort -n

    printf "\n++++++++++++++ failed reboot\n"

    for server in ${!failed[@]}
    do
        printf "%s - %s (mins)\n" "${server}" "${failed[${server}]}"
    done | sort -n

    printf "\n"

    sleep 30                            # adjust as necessary
done

注释:

  • 此代码可能是基于服务器集的更大循环构造的一部分(即,新 ${server_list}
  • 如果服务器列表采用另一种格式(例如,文件、另一个数组等),则需要修改 for 循环以正确填充 failed[] 数组
  • OP 将需要编辑以添加用于查找给定 ${server}
  • 正常运行时间的代码
  • OP(显然)可以根据需要自由重命名 variables/arrays
  • 如果 while 循环继续,OP 可能需要决定要做什么 'too long'
  • 如果新的 ${uptime} 不在 5-60 分钟的范围内,OP 可以添加一个 else 块来为有问题的 ${server}[ 执行一些其他操作=32=]