如何根据 CPU 使用情况自动终止空闲的 GCE 实例？

Question

我运行在实例组中的某些实例上安装了一些稍微不可靠的软件。该软件是通过启动脚本安装和运行的，大部分时间它都可以正常工作，但是大约 ~10% 的新实例运行内存不足并由于某种内存而崩溃泄漏在软件中。我无法自己解决这个漏洞，所以与此同时，我每隔几个小时检查一次实例并杀死任何显示空闲 CPU 的实例（该软件正常消耗所有可用的 CPU 电源).

但是，我使用的是抢占式实例，它们可以随时关闭并重新启动，只要我不主动监视它们，就会留下死实例运行ning。在一天无人看管后，我通常会在仪表板中看到 ~80-85% CPU 的使用率，其余的都被浪费了。

有什么自动化的方法可以杀死这些死掉的实例吗？重新启动它们已由实例组处理。

Answer 1

这个问题好像分为两部分：

识别死亡实例。
杀死那些实例。

就识别死实例而言，一种方法是拥有一个单独的管理实例，该实例不运行该软件并密切关注其他实例。例如，它可以通过定期向各种实例发送健康请求并将无响应实例或报告过高 CPU 使用率的实例标记为不健康来实现这一点。

一旦您的管理实例识别出需要重置的不健康实例，您应该能够使用 API（我猜是 reset 命令）或通过使用 gcloud 命令行工具执行相同的操作。

Answer 2

以下对我有用。这是一个 bash 脚本，它使用 uptime UNIX 命令检查 CPU 上的 15 分钟平均负载是否低于阈值，如果是，则自动关闭系统连续十次检查。您需要运行在您的 VM 实例中执行此操作。

致谢，更详细的解释：Rohit Rawat's blog.

#!/bin/bash
threshold=0.4

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print  }')
  res=$(echo $load'<'$threshold | bc -l)
  if (( $res ))
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    sudo poweroff
  fi

  sleep 60

done

Answer 3

使用 viswajithiii 的答案和这个 post，这在没有 bc 的情况下（不在 GCP 容器 OS 中）有效：

它还会在关机前将历史列表附加到文件中。我将阈值设置得很低，但即使我正在通过 cli 编辑文件，负载也显示为 0.00。如果实例负载很重，可能会更好。

#!/bin/bash
threshold=10

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print  }')
  load2=$(awk -v a="$load" 'BEGIN {print a*100}')
  echo $load2
  if [ $load2 -lt $threshold ]
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    history -a
    sudo poweroff
  fi

  sleep 60

done

这对我的低 cpu 不起作用，但这似乎也是：

#!/bin/bash
threshold=1

count=0
while true
do

  load=$(awk '{u=+; t=++; if (NR==1){u1=u; t1=t;} else print (+-u1) * 1000 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat))
  load2=$(printf "%.0f\n" $load)  
  echo $load
  echo $load2
  if [[ $load2 -lt $threshold ]]
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    history -a
    sudo poweroff
  fi

  sleep 60

done

出于某种原因，它只适用于两种回声负载。

学分：

How to get overall CPU usage (e.g. 57%) on Linux https://unix.stackexchange.com/questions/89712/how-to-convert-floating-point-number-to-integer

仅供参考：根据此处，GCP 监控代理不适用于 N 类型实例：

将其放入 /etc/my_init.d 的启动脚本中并使其可执行：

sudo mkdir /etc/my_init.d
sudo mv autooff.sh /etc/my_init.d/autooff.sh
sudo chmod 755 /etc/my_init.d/autooff.sh

实际上，它正在被删除。而是在实例的编辑中添加自定义元数据：startup-script 和 #! /bin/bash \n~./autooff.sh

Answer 4

我希望我可以将此添加为对 viswajithiii 的评论，但我只是不好意思发表评论。

我发现静态 threshold 变量在我使用可变数量 cpu 的云 VM 时不合适，因为 uptime 的输出随 CPU 的讨论 here.

我更新的脚本在 threshold 赋值下方添加了两行，以按 cpu 的数量缩放 threshold。这允许我设置百分比 cpu 利用率，该百分比将适用于具有不同数量 cpu 的虚拟机。

除此之外，脚本与viswajithiii的相同。

#!/bin/bash

threshold=0.4
n_cpu=$( grep 'model name' /proc/cpuinfo | wc -l )
threshold=$( echo $n_cpu*$threshold | bc )

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print  }')
  res=$(echo $load'<'$threshold | bc -l)
  if (( $res ))
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    sudo poweroff
  fi

  sleep 60

done

如何根据 CPU 使用情况自动终止空闲的 GCE 实例？

How can I automatically kill idle GCE instances based on CPU usage?

google-compute-engine