如何:从 stddev/mean 计算中排除一行并稍后加入
How to: Exclude a row from stddev/mean calculations and join later
所以我试图找到与按环境和功能划分的组相比抛出异常多异常的机器。直觉是整个组的负载和任务类型应该非常相似,所以如果一台机器抛出更多异常,它可能处于某种不良状态并且应该得到服务。
这对于大型机器组来说效果很好,但对于较小的机器组来说有一个问题:如果机器很少,并且只有其中一台抛出很多异常,它可能不会被检测到。原因是因为该数据点是该组的一般 stddev 和均值计算的部分,均值和 stddev 偏向该异常值。
解决方案是要么以某种方式从计算的标准差和整个组的平均值中减去该数据点,要么计算每个 machine/environment/function 组合的标准差和平均值(从stddev/mean 计算)而不仅仅是 environment/function 组。
这是 environment/function 执行此操作的当前代码。有没有一个优雅的解决方案来扩展它来做 machine/environment/function?
// Find sick machines
let SickMachinesAt = (AtTime:datetime , TimeWindow:timespan = 1h, Sigmas:double = 3.0, MinimumExceptionsToTrigger:int = 10) {
// These are the exceptions we are looking at (time window constrained)
let Exceptions = exception
| where EventInfo_Time between((AtTime - TimeWindow ) .. AtTime);
// Calculate mean and stddev for each bin of environmentName + machineFunction
let MeanAndStdDev = Exceptions
| summarize count() by environmentName, machineFunction, machineName
| summarize avg(count_), stdev(count_) by environmentName, machineFunction
| order by environmentName, machineFunction;
let MachinesWithMeanAndStdDev = Exceptions
| summarize count() by environmentName, machineFunction, machineName
| join kind=fullouter MeanAndStdDev on environmentName, machineFunction;
let SickMachines = MachinesWithMeanAndStdDev |
project machineName,
machineFunction,
environmentName,
totalExceptionCount = count_,
cutoff = avg_count_ + Sigmas * stdev_count_,
signalStrength = ((count_ - avg_count_) / stdev_count_)
| where totalExceptionCount > cutoff and totalExceptionCount > MinimumExceptionsToTrigger
| order by signalStrength desc;
SickMachines
}
避免由于强离群值而漏检的一种选择是使用基于百分位数的检测。为此,您可以使用 make-series followed by the built-in series_outliers 函数。
另一种选择是从计算中删除离群值,然后重新连接数据,这需要多次连接。假设您的异常在 Exceptions 中,其中包含以下维度:environmentName、machineFunction、machineName,您可以使用以下伪查询删除所有计数高于第 98 个百分位数的机器:
let ExceptionsCounts = Exceptions
| summarize counts = count() by environmentName, machineFunction, machineName;
let ExceptionsCleansed = ExceptionsCounts
| summarize p98 = percentile(counts, 98) by environmentName, machineFunction
| join kind=inner (ExceptionsCounts) on environmentName, machineFunction
| where counts < p98;
从那里您可以使用 ExceptionsCleansed 来计算 mean/stddev 并继续对原始 Exceptions 进行检测使用与您发布的查询完全相同的计算结果。
所以我试图找到与按环境和功能划分的组相比抛出异常多异常的机器。直觉是整个组的负载和任务类型应该非常相似,所以如果一台机器抛出更多异常,它可能处于某种不良状态并且应该得到服务。
这对于大型机器组来说效果很好,但对于较小的机器组来说有一个问题:如果机器很少,并且只有其中一台抛出很多异常,它可能不会被检测到。原因是因为该数据点是该组的一般 stddev 和均值计算的部分,均值和 stddev 偏向该异常值。
解决方案是要么以某种方式从计算的标准差和整个组的平均值中减去该数据点,要么计算每个 machine/environment/function 组合的标准差和平均值(从stddev/mean 计算)而不仅仅是 environment/function 组。
这是 environment/function 执行此操作的当前代码。有没有一个优雅的解决方案来扩展它来做 machine/environment/function?
// Find sick machines
let SickMachinesAt = (AtTime:datetime , TimeWindow:timespan = 1h, Sigmas:double = 3.0, MinimumExceptionsToTrigger:int = 10) {
// These are the exceptions we are looking at (time window constrained)
let Exceptions = exception
| where EventInfo_Time between((AtTime - TimeWindow ) .. AtTime);
// Calculate mean and stddev for each bin of environmentName + machineFunction
let MeanAndStdDev = Exceptions
| summarize count() by environmentName, machineFunction, machineName
| summarize avg(count_), stdev(count_) by environmentName, machineFunction
| order by environmentName, machineFunction;
let MachinesWithMeanAndStdDev = Exceptions
| summarize count() by environmentName, machineFunction, machineName
| join kind=fullouter MeanAndStdDev on environmentName, machineFunction;
let SickMachines = MachinesWithMeanAndStdDev |
project machineName,
machineFunction,
environmentName,
totalExceptionCount = count_,
cutoff = avg_count_ + Sigmas * stdev_count_,
signalStrength = ((count_ - avg_count_) / stdev_count_)
| where totalExceptionCount > cutoff and totalExceptionCount > MinimumExceptionsToTrigger
| order by signalStrength desc;
SickMachines
}
避免由于强离群值而漏检的一种选择是使用基于百分位数的检测。为此,您可以使用 make-series followed by the built-in series_outliers 函数。
另一种选择是从计算中删除离群值,然后重新连接数据,这需要多次连接。假设您的异常在 Exceptions 中,其中包含以下维度:environmentName、machineFunction、machineName,您可以使用以下伪查询删除所有计数高于第 98 个百分位数的机器:
let ExceptionsCounts = Exceptions
| summarize counts = count() by environmentName, machineFunction, machineName;
let ExceptionsCleansed = ExceptionsCounts
| summarize p98 = percentile(counts, 98) by environmentName, machineFunction
| join kind=inner (ExceptionsCounts) on environmentName, machineFunction
| where counts < p98;
从那里您可以使用 ExceptionsCleansed 来计算 mean/stddev 并继续对原始 Exceptions 进行检测使用与您发布的查询完全相同的计算结果。