什么时候在记录请求持续时间时在普罗米修斯中使用仪表或直方图？

Question

我是指标监控的新手。

如果我们要记录请求的持续时间，我觉得应该用gauge，但实际上有人会用histogram.

比如grpc-ecosystem/go-grpc-prometheus，他们更喜欢用histogram来记录时长。是否有使用度量类型的公认最佳实践？或者这只是他们自己的喜好。

// ServerMetrics represents a collection of metrics to be registered on a
// Prometheus metrics registry for a gRPC server.
type ServerMetrics struct {
    serverStartedCounter          *prom.CounterVec
    serverHandledCounter          *prom.CounterVec
    serverStreamMsgReceived       *prom.CounterVec
    serverStreamMsgSent           *prom.CounterVec
    serverHandledHistogramEnabled bool
    serverHandledHistogramOpts    prom.HistogramOpts
    serverHandledHistogram        *prom.HistogramVec
}

谢谢~

Answer 1

我对此很陌生，但让我尝试回答您的问题。因此，请对我的回答持保留态度，或者可能有人在使用指标观察他们的系统方面有经验。

如https://prometheus.io/docs/concepts/metric_types/

所述

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. So if your goal would be to display the current value (duration time of requests) you could use a gauge. But I think the goal of using metrics is to find problems within your system or generate alerts if and when certain vaules aren't in a predefined range or getting a performance value (like the Apdex score) for your system.

来自https://prometheus.io/docs/concepts/metric_types/#histogram

Use the histogram_quantile() function to calculate quantiles from histograms or even aggregations of histograms. A histogram is also suitable to calculate an Apdex score.

来自https://en.wikipedia.org/wiki/Apdex

Apdex (Application Performance Index) is an open standard developed by an alliance of companies for measuring performance of software applications in computing. Its purpose is to convert measurements into insights about user satisfaction, by specifying a uniform way to analyze and report on the degree to which measured performance meets user expectations.

阅读分位数以及直方图和摘要中的计算 https://prometheus.io/docs/practices/histograms/#quantiles

两条经验法则：

如果需要聚合，选择直方图。
否则，如果您了解将观察到的值的范围和分布，请选择直方图。如果您需要准确的分位数，请选择摘要，无论值的范围和分布如何。

或者像 Adam Woodbeck 在他的书《Network programming with Go》中说的：

The general advice is to use summaries when you don’t know the range of expected values, but I’d advise you to use histograms whenever possible so that you can aggregate histograms on the metrics server.

什么时候在记录请求持续时间时在普罗米修斯中使用仪表或直方图？

When to use gauge or histogram in prometheus in recording request duration?

metrics

go

prometheus