从服务器日志中查找停机持续时间 table
Finding downtime duration from server log table
有一个 table (SQL Server 2008 R2) 为多个服务器保留 up/down 日志。定期对服务器执行 ping 操作,并将它们的状态(启动或关闭)写入此 table。它的结构如下:
CREATE TABLE StatusLog
(
LogID INT PRIMARY KEY,
ServerID INT,
QueryDate DATETIME,
ServerStatus VARCHAR(50)
)
样本数据
INSERT INTO StatusLog
VALUES
(1, '1724', '2016-04-16 09:28:00.000', 'up'),
(2, '1724', '2016-04-16 09:29:00.000', 'up'),
(3, '1724', '2016-04-16 09:30:00.000', 'down'),
(6, '1724', '2016-04-16 09:31:00.000', 'down'),
(8, '1724', '2016-04-16 09:32:00.000', 'down'),
(9, '1724', '2016-04-16 09:33:00.000', 'down'),
(17, '1724', '2016-04-16 09:33:40.000', 'up'),
(18, '1724', '2016-04-16 09:34:00.000', 'up')
我正在尝试查找特定服务器在给定时间段内的总停机时间。
在上面的数据提取中,ID 为 1724 的服务器的状态在 09:30:00 处变为 "down",并在 09:33:40 处变回 "up",总停机时间为 220 秒.
我的做法是:
- 对于每个 "down block",找到 "down" 条记录并将它们的 QueryDate 设置为新列中的开始时间。这很快。
- 在另一个新列中,找到停机开始时间之后的第一条"up" 记录,并将其QueryDate 设置为停机时间结束。这是相当快的。
- 但是,只对down block中的第一个down记录执行此操作,而不要对down block中的其他down执行此操作,否则您会多次错误地计算相同的停机时间。现在要做到这一点,我需要查看行号,这就是事情变得混乱和缓慢的地方。
- 最后,将它们从彼此中提取出来,你就有了那个块的停机时间
- 对所有停机时间求和,得到总停机时间。
我写了下面的脚本,但是速度非常慢(每个服务器都有几十万条日志记录)
DECLARE @StartDate DATE = '2016-04-01'
DECLARE @EndDate DATE = '2016-04-30'
DECLARE @ServerID INT = '1724'
;WITH CTE_StatusLog AS
(
SELECT LogID, QueryDate, ServerStatus,
ROW_NUMBER() OVER (ORDER BY QueryDate) AS RN
FROM StatusLog
WHERE ServerID = @ServerID
AND QueryDate BETWEEN @StartDate AND @EndDate
)
SELECT LogID,
QueryDate,
ServerStatus,
RN,
DownStarted = CASE WHEN s1.ServerStatus = 'down'
THEN s1.QueryDate END,
DownEnded = (SELECT TOP 1 QueryDate
FROM CTE_StatusLog AS s2
WHERE s2.QueryDate > s1.QueryDate
AND s1.ServerStatus = 'down'
AND s2.ServerStatus = 'up'
AND (SELECT s3.ServerStatus
FROM CTE_StatusLog AS s3
WHERE s3.RN = s1.RN-1) <> 'down'
ORDER BY s2.QueryDate),
DownDuration = DATEDIFF(SECOND,
CASE WHEN s1.ServerStatus = 'down'
THEN s1.QueryDate END,
(SELECT TOP 1 QueryDate
FROM CTE_StatusLog AS s2
WHERE s2.QueryDate > s1.QueryDate
AND s1.ServerStatus = 'down'
AND s2.ServerStatus = 'up'
AND (SELECT s3.ServerStatus
FROM CTE_StatusLog AS s3
WHERE s3.RN = s1.RN-1) <> 'down'
ORDER BY s2.QueryDate))
FROM CTE_StatusLog AS s1
WHERE QueryDate BETWEEN @StartDate AND @EndDate
ORDER BY s1.RN
输出:
LogID QueryDate ServerStatus RN DownStarted DownEnded DownDuration
1 2016-04-16 09:28:00.000 up 1 NULL NULL NULL
2 2016-04-16 09:29:00.000 up 2 NULL NULL NULL
3 2016-04-16 09:30:00.000 down 3 2016-04-16 09:30:00.000 2016-04-16 09:33:40.000 220
6 2016-04-16 09:31:00.000 down 4 2016-04-16 09:31:00.000 NULL NULL
8 2016-04-16 09:32:00.000 down 5 2016-04-16 09:32:00.000 NULL NULL
9 2016-04-16 09:33:00.000 down 6 2016-04-16 09:33:00.000 NULL NULL
17 2016-04-16 09:33:40.000 up 7 NULL NULL NULL
18 2016-04-16 09:34:00.000 up 8 NULL NULL NULL
我该如何改进这个脚本,或者是否有更好的方法来计算这个 table 结构的停机时间?
如果您只需要总停机时间,您可以弄清楚每一行代表什么:假设每一行代表自上次检查该服务器以来停机时间的秒数。然后对这些行求和:
DECLARE @StartDate DATE = '2016-04-01'
DECLARE @EndDate DATE = '2016-04-30'
DECLARE @ServerID INT = '1724'
SELECT
individualRows.ServerId,
individualRows.ServerStatus,
SUM(secondsInState) AS TotalTime
FROM
(Select
statusLog.ServerId,
statusLog.QueryDate,
statusLog.ServerStatus,
DateDiff(second, PreviousStatus.QueryDate, statusLog.QueryDate) as secondsInState
FROM
StatusLog
left outer join
StatusLog AS PreviousStatus
ON StatusLog.ServerId = PreviousStatus.ServerId
AND PreviousStatus.QueryDate < StatusLog.QueryDate
AND PreviousStatus.QueryDate = ( SELECT Max(QueryDate) FROM statusLog sl2 where sl2.ServerId= StatusLog.ServerId and sl2.QueryDate < StatusLog.QueryDate)
WHERE StatusLog.QueryDate > @StartDate
AND StatusLog.QueryDate < @EndDate
AND StatusLog.ServerId = @ServerID ) AS individualRows
GROUP BY
individualRows.ServerId,
individualRows.ServerStatus
如果你真的需要每次中断的时间,我可能会尝试一个临时 table 每一行与前一行以及前一行处于相反状态。类似于你的结果。然后我会过滤并汇总该温度 table.
我的经验是,一旦 table 获得多行数据,临时 tables 就比 CTE 快得多。
我会通过获取每个故障记录的下一次启动时间来解决这个问题。在 SQL Server 2008 中,这使用 outer apply
:
select sl.*, slup.querydate as next_update,
datediff(second, sl.querydate, slup.querydate) as down_in_seconds
from statuslog sl outer apply
(select top 1 sl2.*
from statuslog sl2
where sl2.serverid = sl.serverid and
sl2.querydate >= sl.querydate and
sl2.serverstatus = 'up'
order by sl2.querydate asc
) slup
where sl.serverstatus = 'down';
如果您想按停机时间汇总,那么我会使用聚合:
select servid, min(querydate) as down_date, next_update,
max(down_in_seconds)
from (select sl.*, slup.querydate as next_update,
datediff(second, sl.querydate, slup.querydate) as down_in_seconds
from statuslog sl outer apply
(select top 1 sl2.*
from statuslog sl2
where sl2.serverid = sl.serverid and
sl2.querydate >= sl.querydate and
sl2.serverstatus = 'up'
order by sl2.querydate asc
) slup
where sl.serverstatus = 'down'
) slud
group by serverid, next_update;
有一个 table (SQL Server 2008 R2) 为多个服务器保留 up/down 日志。定期对服务器执行 ping 操作,并将它们的状态(启动或关闭)写入此 table。它的结构如下:
CREATE TABLE StatusLog
(
LogID INT PRIMARY KEY,
ServerID INT,
QueryDate DATETIME,
ServerStatus VARCHAR(50)
)
样本数据
INSERT INTO StatusLog
VALUES
(1, '1724', '2016-04-16 09:28:00.000', 'up'),
(2, '1724', '2016-04-16 09:29:00.000', 'up'),
(3, '1724', '2016-04-16 09:30:00.000', 'down'),
(6, '1724', '2016-04-16 09:31:00.000', 'down'),
(8, '1724', '2016-04-16 09:32:00.000', 'down'),
(9, '1724', '2016-04-16 09:33:00.000', 'down'),
(17, '1724', '2016-04-16 09:33:40.000', 'up'),
(18, '1724', '2016-04-16 09:34:00.000', 'up')
我正在尝试查找特定服务器在给定时间段内的总停机时间。 在上面的数据提取中,ID 为 1724 的服务器的状态在 09:30:00 处变为 "down",并在 09:33:40 处变回 "up",总停机时间为 220 秒.
我的做法是:
- 对于每个 "down block",找到 "down" 条记录并将它们的 QueryDate 设置为新列中的开始时间。这很快。
- 在另一个新列中,找到停机开始时间之后的第一条"up" 记录,并将其QueryDate 设置为停机时间结束。这是相当快的。
- 但是,只对down block中的第一个down记录执行此操作,而不要对down block中的其他down执行此操作,否则您会多次错误地计算相同的停机时间。现在要做到这一点,我需要查看行号,这就是事情变得混乱和缓慢的地方。
- 最后,将它们从彼此中提取出来,你就有了那个块的停机时间
- 对所有停机时间求和,得到总停机时间。
我写了下面的脚本,但是速度非常慢(每个服务器都有几十万条日志记录)
DECLARE @StartDate DATE = '2016-04-01'
DECLARE @EndDate DATE = '2016-04-30'
DECLARE @ServerID INT = '1724'
;WITH CTE_StatusLog AS
(
SELECT LogID, QueryDate, ServerStatus,
ROW_NUMBER() OVER (ORDER BY QueryDate) AS RN
FROM StatusLog
WHERE ServerID = @ServerID
AND QueryDate BETWEEN @StartDate AND @EndDate
)
SELECT LogID,
QueryDate,
ServerStatus,
RN,
DownStarted = CASE WHEN s1.ServerStatus = 'down'
THEN s1.QueryDate END,
DownEnded = (SELECT TOP 1 QueryDate
FROM CTE_StatusLog AS s2
WHERE s2.QueryDate > s1.QueryDate
AND s1.ServerStatus = 'down'
AND s2.ServerStatus = 'up'
AND (SELECT s3.ServerStatus
FROM CTE_StatusLog AS s3
WHERE s3.RN = s1.RN-1) <> 'down'
ORDER BY s2.QueryDate),
DownDuration = DATEDIFF(SECOND,
CASE WHEN s1.ServerStatus = 'down'
THEN s1.QueryDate END,
(SELECT TOP 1 QueryDate
FROM CTE_StatusLog AS s2
WHERE s2.QueryDate > s1.QueryDate
AND s1.ServerStatus = 'down'
AND s2.ServerStatus = 'up'
AND (SELECT s3.ServerStatus
FROM CTE_StatusLog AS s3
WHERE s3.RN = s1.RN-1) <> 'down'
ORDER BY s2.QueryDate))
FROM CTE_StatusLog AS s1
WHERE QueryDate BETWEEN @StartDate AND @EndDate
ORDER BY s1.RN
输出:
LogID QueryDate ServerStatus RN DownStarted DownEnded DownDuration
1 2016-04-16 09:28:00.000 up 1 NULL NULL NULL
2 2016-04-16 09:29:00.000 up 2 NULL NULL NULL
3 2016-04-16 09:30:00.000 down 3 2016-04-16 09:30:00.000 2016-04-16 09:33:40.000 220
6 2016-04-16 09:31:00.000 down 4 2016-04-16 09:31:00.000 NULL NULL
8 2016-04-16 09:32:00.000 down 5 2016-04-16 09:32:00.000 NULL NULL
9 2016-04-16 09:33:00.000 down 6 2016-04-16 09:33:00.000 NULL NULL
17 2016-04-16 09:33:40.000 up 7 NULL NULL NULL
18 2016-04-16 09:34:00.000 up 8 NULL NULL NULL
我该如何改进这个脚本,或者是否有更好的方法来计算这个 table 结构的停机时间?
如果您只需要总停机时间,您可以弄清楚每一行代表什么:假设每一行代表自上次检查该服务器以来停机时间的秒数。然后对这些行求和:
DECLARE @StartDate DATE = '2016-04-01'
DECLARE @EndDate DATE = '2016-04-30'
DECLARE @ServerID INT = '1724'
SELECT
individualRows.ServerId,
individualRows.ServerStatus,
SUM(secondsInState) AS TotalTime
FROM
(Select
statusLog.ServerId,
statusLog.QueryDate,
statusLog.ServerStatus,
DateDiff(second, PreviousStatus.QueryDate, statusLog.QueryDate) as secondsInState
FROM
StatusLog
left outer join
StatusLog AS PreviousStatus
ON StatusLog.ServerId = PreviousStatus.ServerId
AND PreviousStatus.QueryDate < StatusLog.QueryDate
AND PreviousStatus.QueryDate = ( SELECT Max(QueryDate) FROM statusLog sl2 where sl2.ServerId= StatusLog.ServerId and sl2.QueryDate < StatusLog.QueryDate)
WHERE StatusLog.QueryDate > @StartDate
AND StatusLog.QueryDate < @EndDate
AND StatusLog.ServerId = @ServerID ) AS individualRows
GROUP BY
individualRows.ServerId,
individualRows.ServerStatus
如果你真的需要每次中断的时间,我可能会尝试一个临时 table 每一行与前一行以及前一行处于相反状态。类似于你的结果。然后我会过滤并汇总该温度 table.
我的经验是,一旦 table 获得多行数据,临时 tables 就比 CTE 快得多。
我会通过获取每个故障记录的下一次启动时间来解决这个问题。在 SQL Server 2008 中,这使用 outer apply
:
select sl.*, slup.querydate as next_update,
datediff(second, sl.querydate, slup.querydate) as down_in_seconds
from statuslog sl outer apply
(select top 1 sl2.*
from statuslog sl2
where sl2.serverid = sl.serverid and
sl2.querydate >= sl.querydate and
sl2.serverstatus = 'up'
order by sl2.querydate asc
) slup
where sl.serverstatus = 'down';
如果您想按停机时间汇总,那么我会使用聚合:
select servid, min(querydate) as down_date, next_update,
max(down_in_seconds)
from (select sl.*, slup.querydate as next_update,
datediff(second, sl.querydate, slup.querydate) as down_in_seconds
from statuslog sl outer apply
(select top 1 sl2.*
from statuslog sl2
where sl2.serverid = sl.serverid and
sl2.querydate >= sl.querydate and
sl2.serverstatus = 'up'
order by sl2.querydate asc
) slup
where sl.serverstatus = 'down'
) slud
group by serverid, next_update;