仅显示来自未显示在另一个查询中的机器的数据
showing only data from machines which do not show up in another query
我有一个相当典型的节目CPU用法查询
100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80
这导致数据看起来有点像这样:
{instance="opus143.domain.com:9182"} 94.07140535559513
{instance="opus162.domain.com:9182"} 90.00755315803018
{instance="opus163.domain.com:9182"} 85.48084077380952
但我只想查询未出现在另一个列表中的机器的值
opus_local_slaves_count > 0
opus_local_slaves_count{instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
我想我已经能够使用 label_replace 在每种情况下给我主机
(label_replace((100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80), "host", "","instance","(.*?)[.].*"))
{host="opus143",instance="opus143.domain.com:9182"} 94.07140535559513
{host="opus162",instance="opus162.domain.com:9182"} 90.00755315803018
{host="opus163",instance="opus163.domain.com:9182"} 85.48084077380952
label_replace((opus_local_slaves_count > 0), "host", "","instance","(.*?)[.].*")
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
但现在我真的无法尝试从第一个列表中排除第二个列表中的主机。这在 PromQL 中甚至可能吗?在 SQL 中,这将是一个简单的 NOT IN subquery
更新:对于上下文,我想要实现的是能够在服务器上发出高 CPU 警报,第二个列表中的服务器除外,它们应该具有高 CPU 利用。也许有更好的方法?
解决了!
对于发现此问题并希望做类似事情的任何人...突出的关键字是 UNLESS!
我首先通过创建记录规则来简化事情:
groups:
- name: custom_rules
rules:
- record: wmi_cpu_time_total_instance
expr: 100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)
- record: wmi_cpu_time_total_instance_host
expr: label_replace(wmi_cpu_time_total_instance, "host", "", "instance","(.*?)[.].*")
- record: opus_local_slaves_count_instance_host
expr: label_replace(opus_local_slaves_count, "host", "", "instance","(.*?)[.].*")
它封装了计算和添加主机标签的大部分复杂性,然后我找到了这个博客(谢谢 Chris Siebenmann)https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusFindUnpairedMetrics 它向我指出了 UNLESS 关键字,因此我可以编写简单的查询
wmi_cpu_time_total_instance_host unless on(host) (opus_local_slaves_count_instance_host > 0)
给出了没有 opus_local_slaves_count 标签或 opus_local_slaves_count = 0
的主机列表
瞧!
我有一个相当典型的节目CPU用法查询
100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80
这导致数据看起来有点像这样:
{instance="opus143.domain.com:9182"} 94.07140535559513
{instance="opus162.domain.com:9182"} 90.00755315803018
{instance="opus163.domain.com:9182"} 85.48084077380952
但我只想查询未出现在另一个列表中的机器的值
opus_local_slaves_count > 0
opus_local_slaves_count{instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
我想我已经能够使用 label_replace 在每种情况下给我主机
(label_replace((100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80), "host", "","instance","(.*?)[.].*"))
{host="opus143",instance="opus143.domain.com:9182"} 94.07140535559513
{host="opus162",instance="opus162.domain.com:9182"} 90.00755315803018
{host="opus163",instance="opus163.domain.com:9182"} 85.48084077380952
label_replace((opus_local_slaves_count > 0), "host", "","instance","(.*?)[.].*")
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
但现在我真的无法尝试从第一个列表中排除第二个列表中的主机。这在 PromQL 中甚至可能吗?在 SQL 中,这将是一个简单的 NOT IN subquery
更新:对于上下文,我想要实现的是能够在服务器上发出高 CPU 警报,第二个列表中的服务器除外,它们应该具有高 CPU 利用。也许有更好的方法?
解决了!
对于发现此问题并希望做类似事情的任何人...突出的关键字是 UNLESS!
我首先通过创建记录规则来简化事情:
groups:
- name: custom_rules
rules:
- record: wmi_cpu_time_total_instance
expr: 100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)
- record: wmi_cpu_time_total_instance_host
expr: label_replace(wmi_cpu_time_total_instance, "host", "", "instance","(.*?)[.].*")
- record: opus_local_slaves_count_instance_host
expr: label_replace(opus_local_slaves_count, "host", "", "instance","(.*?)[.].*")
它封装了计算和添加主机标签的大部分复杂性,然后我找到了这个博客(谢谢 Chris Siebenmann)https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusFindUnpairedMetrics 它向我指出了 UNLESS 关键字,因此我可以编写简单的查询
wmi_cpu_time_total_instance_host unless on(host) (opus_local_slaves_count_instance_host > 0)
给出了没有 opus_local_slaves_count 标签或 opus_local_slaves_count = 0
的主机列表瞧!