从直方图计算平均值和百分位数?
Calculating average and percentiles from a histogram map?
我编写了一个计时器,它可以测量任何多线程应用程序中特定代码的性能。在下面的计时器中,它还会用 x 毫秒的调用次数填充地图。我将使用这张地图作为我的直方图的一部分来做进一步的分析,比如多少百分比的调用花费了这么多毫秒等等。
public static class StopWatch {
public static ConcurrentHashMap<Long, Long> histogram = new ConcurrentHashMap<Long, Long>();
/**
* Creates an instance of the timer and starts it running.
*/
public static StopWatch getInstance() {
return new StopWatch();
}
private long m_end = -1;
private long m_interval = -1;
private final long m_start;
private StopWatch() {
m_start = m_interval = currentTime();
}
/**
* Returns in milliseconds the amount of time that has elapsed since the timer was created. If the
* <code>stop</code> method has been invoked, then this returns instead the elapsed time between the creation of
* the timer and the moment when <code>stop</code> was invoked.
*
* @return duration it took
*/
public long getDuration() {
long result = 0;
final long startTime = m_start;
final long endTime = isStopWatchRunning() ? currentTime() : m_end;
result = convertNanoToMilliseconds(endTime - startTime);
boolean done = false;
while (!done) {
Long oldValue = histogram.putIfAbsent(result, 1L);
if (oldValue != null) {
done = histogram.replace(result, oldValue, oldValue + 1);
} else {
done = true;
}
}
return result;
}
/**
* Returns in milliseconds the amount of time that has elapsed since the last invocation of this same method. If
* this method has not previously been invoked, then it is the amount of time that has elapsed since the timer
* was created. <strong>Note</strong> that once the <code>stop</code> method has been invoked this will just
* return zero.
*
* @return interval period
*/
public long getInterval() {
long result = 0;
final long startTime = m_interval;
final long endTime;
if (isStopWatchRunning()) {
endTime = m_interval = currentTime();
} else {
endTime = m_end;
}
result = convertNanoToMilliseconds(endTime - startTime);
return result;
}
/**
* Stops the timer from advancing. This has an impact on the values returned by both the
* <code>getDuration</code> and the <code>getInterval</code> methods.
*/
public void stop() {
if (isStopWatchRunning()) {
m_end = currentTime();
}
}
/**
* What is the current time in nanoseconds?
*
* @return returns back the current time in nanoseconds
*/
private long currentTime() {
return System.nanoTime();
}
/**
* This is used to check whether the timer is alive or not
*
* @return checks whether the timer is running or not
*/
private boolean isStopWatchRunning() {
return (m_end <= 0);
}
/**
* This is used to convert NanoSeconds to Milliseconds
*
* @param nanoseconds
* @return milliseconds value of nanoseconds
*/
private long convertNanoToMilliseconds(final long nanoseconds) {
return nanoseconds / 1000000L;
}
}
例如,这就是我将使用上面的计时器 class 来测量多线程应用程序中特定代码的性能的方法:
StopWatch timer = StopWatch.getInstance();
//... some code here to measure
timer.getDuration();
现在我的问题是 - 从直方图计算请求的平均值、中值、第 95 和第 99 个百分位数的最佳方法是什么?我的意思是说,我只想在我的 StopWatch class 中添加某些方法,这些方法将完成所有计算,例如找到平均值、中位数、第 95 个和第 99 个百分位数。
然后直接用StopWatch
实例就可以了。
我的直方图将如下所示:
key - means number of milliseconds
value - means number of calls that took that much milliseconds.
给出如下的直方图(频率列表)
Value | Frequency
------+----------
1 | 5
2 | 3
3 | 1
4 | 7
5 | 2
..
每个 Value
在您的数据集中出现 Frequency
次。
public static double getMean (ConcurrentHashMap<Long,Long> histogram)
{
double mean = 0;
double a = 0;
double b = 0;
TreeSet<Long> values = histogram.keySet();
for (Long value : values)
{
// a = a + (value x frequency)
a = a + (value * histogram.get(value));
// b = b + frequency
b = b + histogram.get(value);
}
// mean = SUM(value x frequency) / SUM(frequency)
mean = (a / b);
return mean;
}
平均值很容易实现。中位数是第 50 个百分位数,因此您只需要一个有效的百分位数方法,并为中位数创建一个实用方法。有 several variations of Percentile calculation,但这个应该生成与 Microsoft Excel PERCENTILE.INC 函数相同的结果。
import java.util.Map;
import java.util.SortedSet;
import java.util.concurrent.ConcurrentSkipListSet;
public class HistogramStatistics
{
public static Double average(final Map<Long, Long> histogram)
{
return HistogramStatistics.mean(histogram);
}
public static Double mean(final Map<Long, Long> histogram)
{
double sum = 0L;
for (Long value : histogram.keySet())
{
sum += (value * histogram.get(value));
}
return sum / (double) HistogramStatistics.count(histogram);
}
public static Double median(final Map<Long, Long> histogram)
{
return HistogramStatistics.percentile(histogram, 0.50d);
}
public static Double percentile(final Map<Long, Long> histogram, final double percent)
{
if ((percent < 0d) || (percent > 1d))
{
throw new IllegalArgumentException("Percentile must be between 0.00 and 1.00.");
}
if ((histogram == null) || histogram.isEmpty())
{
return null;
}
double n = (percent * (HistogramStatistics.count(histogram).doubleValue() - 1d)) + 1d;
double d = n - Math.floor(n);
SortedSet<Long> bins = new ConcurrentSkipListSet<Long>(histogram.keySet());
long observationsBelowBinInclusive = 0L;
Long lowBin = bins.first();
Double valuePercentile = null;
for (Long highBin : bins)
{
observationsBelowBinInclusive += histogram.get(highBin);
if (n <= observationsBelowBinInclusive)
{
if ((d == 0f) || (histogram.get(highBin) > 1L))
{
lowBin = highBin;
}
valuePercentile = lowBin.doubleValue() + ((highBin - lowBin) * d);
break;
}
lowBin = highBin;
}
return valuePercentile;
}
public static Long count(final Map<Long, Long> histogram)
{
long observations = 0L;
for (Long value : histogram.keySet())
{
observations += histogram.get(value);
}
return observations;
}
}
您可能希望将测量的持续时间四舍五入到某个所需的分辨率,例如以 10 或 100 毫秒为单位,这样您的地图就不会因所有可能的延迟值而变得臃肿。
在最坏的情况下,您还可以使用数组而不是映射进行 O(1) 查找,并利用内存局部性优势。
此外,您可以使用 LongAdder or an AtomicLong 来代替 getDuration()
中的 while (!done)
循环,这应该会更快。
至于通过分箱直方图可靠地计算百分位数,您可以查看 HBPE 以获取参考实现。免责声明:我是作者。
我编写了一个计时器,它可以测量任何多线程应用程序中特定代码的性能。在下面的计时器中,它还会用 x 毫秒的调用次数填充地图。我将使用这张地图作为我的直方图的一部分来做进一步的分析,比如多少百分比的调用花费了这么多毫秒等等。
public static class StopWatch {
public static ConcurrentHashMap<Long, Long> histogram = new ConcurrentHashMap<Long, Long>();
/**
* Creates an instance of the timer and starts it running.
*/
public static StopWatch getInstance() {
return new StopWatch();
}
private long m_end = -1;
private long m_interval = -1;
private final long m_start;
private StopWatch() {
m_start = m_interval = currentTime();
}
/**
* Returns in milliseconds the amount of time that has elapsed since the timer was created. If the
* <code>stop</code> method has been invoked, then this returns instead the elapsed time between the creation of
* the timer and the moment when <code>stop</code> was invoked.
*
* @return duration it took
*/
public long getDuration() {
long result = 0;
final long startTime = m_start;
final long endTime = isStopWatchRunning() ? currentTime() : m_end;
result = convertNanoToMilliseconds(endTime - startTime);
boolean done = false;
while (!done) {
Long oldValue = histogram.putIfAbsent(result, 1L);
if (oldValue != null) {
done = histogram.replace(result, oldValue, oldValue + 1);
} else {
done = true;
}
}
return result;
}
/**
* Returns in milliseconds the amount of time that has elapsed since the last invocation of this same method. If
* this method has not previously been invoked, then it is the amount of time that has elapsed since the timer
* was created. <strong>Note</strong> that once the <code>stop</code> method has been invoked this will just
* return zero.
*
* @return interval period
*/
public long getInterval() {
long result = 0;
final long startTime = m_interval;
final long endTime;
if (isStopWatchRunning()) {
endTime = m_interval = currentTime();
} else {
endTime = m_end;
}
result = convertNanoToMilliseconds(endTime - startTime);
return result;
}
/**
* Stops the timer from advancing. This has an impact on the values returned by both the
* <code>getDuration</code> and the <code>getInterval</code> methods.
*/
public void stop() {
if (isStopWatchRunning()) {
m_end = currentTime();
}
}
/**
* What is the current time in nanoseconds?
*
* @return returns back the current time in nanoseconds
*/
private long currentTime() {
return System.nanoTime();
}
/**
* This is used to check whether the timer is alive or not
*
* @return checks whether the timer is running or not
*/
private boolean isStopWatchRunning() {
return (m_end <= 0);
}
/**
* This is used to convert NanoSeconds to Milliseconds
*
* @param nanoseconds
* @return milliseconds value of nanoseconds
*/
private long convertNanoToMilliseconds(final long nanoseconds) {
return nanoseconds / 1000000L;
}
}
例如,这就是我将使用上面的计时器 class 来测量多线程应用程序中特定代码的性能的方法:
StopWatch timer = StopWatch.getInstance();
//... some code here to measure
timer.getDuration();
现在我的问题是 - 从直方图计算请求的平均值、中值、第 95 和第 99 个百分位数的最佳方法是什么?我的意思是说,我只想在我的 StopWatch class 中添加某些方法,这些方法将完成所有计算,例如找到平均值、中位数、第 95 个和第 99 个百分位数。
然后直接用StopWatch
实例就可以了。
我的直方图将如下所示:
key - means number of milliseconds
value - means number of calls that took that much milliseconds.
给出如下的直方图(频率列表)
Value | Frequency
------+----------
1 | 5
2 | 3
3 | 1
4 | 7
5 | 2
..
每个 Value
在您的数据集中出现 Frequency
次。
public static double getMean (ConcurrentHashMap<Long,Long> histogram)
{
double mean = 0;
double a = 0;
double b = 0;
TreeSet<Long> values = histogram.keySet();
for (Long value : values)
{
// a = a + (value x frequency)
a = a + (value * histogram.get(value));
// b = b + frequency
b = b + histogram.get(value);
}
// mean = SUM(value x frequency) / SUM(frequency)
mean = (a / b);
return mean;
}
平均值很容易实现。中位数是第 50 个百分位数,因此您只需要一个有效的百分位数方法,并为中位数创建一个实用方法。有 several variations of Percentile calculation,但这个应该生成与 Microsoft Excel PERCENTILE.INC 函数相同的结果。
import java.util.Map;
import java.util.SortedSet;
import java.util.concurrent.ConcurrentSkipListSet;
public class HistogramStatistics
{
public static Double average(final Map<Long, Long> histogram)
{
return HistogramStatistics.mean(histogram);
}
public static Double mean(final Map<Long, Long> histogram)
{
double sum = 0L;
for (Long value : histogram.keySet())
{
sum += (value * histogram.get(value));
}
return sum / (double) HistogramStatistics.count(histogram);
}
public static Double median(final Map<Long, Long> histogram)
{
return HistogramStatistics.percentile(histogram, 0.50d);
}
public static Double percentile(final Map<Long, Long> histogram, final double percent)
{
if ((percent < 0d) || (percent > 1d))
{
throw new IllegalArgumentException("Percentile must be between 0.00 and 1.00.");
}
if ((histogram == null) || histogram.isEmpty())
{
return null;
}
double n = (percent * (HistogramStatistics.count(histogram).doubleValue() - 1d)) + 1d;
double d = n - Math.floor(n);
SortedSet<Long> bins = new ConcurrentSkipListSet<Long>(histogram.keySet());
long observationsBelowBinInclusive = 0L;
Long lowBin = bins.first();
Double valuePercentile = null;
for (Long highBin : bins)
{
observationsBelowBinInclusive += histogram.get(highBin);
if (n <= observationsBelowBinInclusive)
{
if ((d == 0f) || (histogram.get(highBin) > 1L))
{
lowBin = highBin;
}
valuePercentile = lowBin.doubleValue() + ((highBin - lowBin) * d);
break;
}
lowBin = highBin;
}
return valuePercentile;
}
public static Long count(final Map<Long, Long> histogram)
{
long observations = 0L;
for (Long value : histogram.keySet())
{
observations += histogram.get(value);
}
return observations;
}
}
您可能希望将测量的持续时间四舍五入到某个所需的分辨率,例如以 10 或 100 毫秒为单位,这样您的地图就不会因所有可能的延迟值而变得臃肿。
在最坏的情况下,您还可以使用数组而不是映射进行 O(1) 查找,并利用内存局部性优势。
此外,您可以使用 LongAdder or an AtomicLong 来代替 getDuration()
中的 while (!done)
循环,这应该会更快。
至于通过分箱直方图可靠地计算百分位数,您可以查看 HBPE 以获取参考实现。免责声明:我是作者。