仅显示来自未显示在另一个查询中的机器的数据

showing only data from machines which do not show up in another query

我有一个相当典型的节目CPU用法查询

100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80

这导致数据看起来有点像这样:

{instance="opus143.domain.com:9182"} 94.07140535559513 
{instance="opus162.domain.com:9182"} 90.00755315803018 
{instance="opus163.domain.com:9182"} 85.48084077380952 

但我只想查询未出现在另一个列表中的机器的值

opus_local_slaves_count > 0

opus_local_slaves_count{instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 

我想我已经能够使用 label_replace 在每种情况下给我主机

(label_replace((100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80), "host", "","instance","(.*?)[.].*"))

{host="opus143",instance="opus143.domain.com:9182"} 94.07140535559513 
{host="opus162",instance="opus162.domain.com:9182"} 90.00755315803018 
{host="opus163",instance="opus163.domain.com:9182"} 85.48084077380952 

label_replace((opus_local_slaves_count > 0), "host", "","instance","(.*?)[.].*")

opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 

但现在我真的无法尝试从第一个列表中排除第二个列表中的主机。这在 PromQL 中甚至可能吗?在 SQL 中,这将是一个简单的 NOT IN subquery

更新:对于上下文,我想要实现的是能够在服务器上发出高 CPU 警报,第二个列表中的服务器除外,它们应该具有高 CPU 利用。也许有更好的方法?

解决了!

对于发现此问题并希望做类似事情的任何人...突出的关键字是 UNLESS!

我首先通过创建记录规则来简化事情:

groups:
- name: custom_rules
  rules:
  - record: wmi_cpu_time_total_instance
    expr: 100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)
  - record: wmi_cpu_time_total_instance_host
    expr: label_replace(wmi_cpu_time_total_instance, "host", "", "instance","(.*?)[.].*")
  - record: opus_local_slaves_count_instance_host
    expr: label_replace(opus_local_slaves_count, "host", "", "instance","(.*?)[.].*")

它封装了计算和添加主机标签的大部分复杂性,然后我找到了这个博客(谢谢 Chris Siebenmann)https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusFindUnpairedMetrics 它向我指出了 UNLESS 关键字,因此我可以编写简单的查询

wmi_cpu_time_total_instance_host unless on(host) (opus_local_slaves_count_instance_host > 0)

给出了没有 opus_local_slaves_count 标签或 opus_local_slaves_count = 0

的主机列表

瞧!