基于症状的监测和基于原因的监测是什么意思?

What is meant by symptom based monitoring and cause based monitoring?

在 SRE 上下文中,基于症状和原因的监控是什么意思?为什么它如此重要?哪些工具用于此类监控?

Symptoms Versus Causes


Your monitoring system should address two questions: what’s broken, and why?

The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause. Table below lists some hypothetical symptoms and corresponding causes.

"What" versus "why" is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.

例子

+--------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
|                        Symptom                         |                                                      Cause                                                      |
+--------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
| I’m serving HTTP 500s or 404s                          | Database servers are refusing connections                                                                       |
|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| My responses are slow                                  | CPUs are overloaded by a bogosort, or an Ethernet cable is crimped under a rack, visible as partial packet loss |
|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| Users in Antarctica aren’t receiving animated cat GIFs | Your Content Distribution Network hates scientists and felines, and thus blacklisted some client IPs            |
|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| Private content is world-readable                      | A new software push caused ACLs to be forgotten and allowed all requests                                        |
+--------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+

Source

用于监控的工具取决于您的平台、您想要监控的内容和方式。例如,Azure Monitor is for the applications and infrastructure hosted in Azure, Amazon CloudWatch 对于 AWS 中的那些人,等等。