聚合重叠段以测量有效长度
Aggregate Overlapping Segments to Measure Effective Length
我有一个 road_events
table:
create table road_events (
event_id number(4,0),
road_id number(4,0),
year number(4,0),
from_meas number(10,2),
to_meas number(10,2),
total_road_length number(10,2)
);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (1,1,2020,25,50,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (2,1,2000,25,50,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (3,1,1980,0,25,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (4,1,1960,75,100,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (5,1,1940,1,100,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (6,2,2000,10,30,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (7,2,1975,30,60,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (8,2,1950,50,90,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (9,3,2050,40,90,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (10,4,2040,0,200,200);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (11,4,2013,0,199,200);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (12,4,2001,0,200,200);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (13,5,1985,50,70,300);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (14,5,1985,10,50,300);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (15,5,1965,1,301,300);
commit;
select * from road_events;
EVENT_ID ROAD_ID YEAR FROM_MEAS TO_MEAS TOTAL_ROAD_LENGTH
---------- ---------- ---------- ---------- ---------- -----------------
1 1 2020 25 50 100
2 1 2000 25 50 100
3 1 1980 0 25 100
4 1 1960 75 100 100
5 1 1940 1 100 100
6 2 2000 10 30 100
7 2 1975 30 60 100
8 2 1950 50 90 100
9 3 2050 40 90 100
10 4 2040 0 200 200
11 4 2013 0 199 200
12 4 2001 0 200 200
13 5 1985 50 70 300
14 5 1985 10 50 300
15 5 1965 1 301 300
我想select表示每条道路上最近工作的事件。
这是一个棘手的操作,因为事件通常只涉及道路的一部分。这意味着我不能简单地 select 每条道路的最新事件;我只需要 select 最近 不重叠的事件里程 。
可能的逻辑(按顺序):
我不愿意猜测这个问题是如何解决的,因为它最终可能弊大于利(有点像 XY Problem)。另一方面,它可能会提供对问题本质的洞察,所以这里是:
- Select 每条道路的最近事件。我们将调用最近的事件:
event A
.
- 如果
event A
是 >= total_road_length
,那么这就是我所需要的。算法到此结束。
- 否则,获取与
event A
具有不同范围的下一个按时间顺序排列的事件 (event B
)。
- 如果
event B
的范围与 event A
的范围重叠,则只获取 event B
不重叠的部分。
- 重复步骤 3 和 4,直到事件总长度为
= total_road_length
。或者当这条路没有更多事件时停止。
问题:
我知道这是一项艰巨的任务,但是 要做到这一点需要什么?
这是一个经典的线性参考问题。如果我可以将线性引用操作作为查询的一部分,那将非常有帮助。
结果将是:
EVENT_ID ROAD_ID YEAR TOTAL_ROAD_LENGTH EVENT_LENGTH
---------- ---------- ---------- ----------------- ------------
1 1 2020 100 25
3 1 1980 100 25
4 1 1960 100 25
5 1 1940 100 25
6 2 2000 100 20
7 2 1975 100 30
8 2 1950 100 30
9 3 2050 100 50
10 4 2040 200 200
13 5 1985 300 20
14 5 1985 300 40
15 5 1965 300 240
相关问题:
今天想太多了,但是我现在有一些东西忽略了+/- 10米。
首先创建了一个函数,将成对作为字符串接收/从成对接收,并且 returns 字符串中成对所覆盖的距离。例如 '10:20;35:45' returns 20.
CREATE
OR replace FUNCTION get_distance_range_str (strRangeStr VARCHAR2)
RETURN NUMBER IS intRetNum NUMBER;
BEGIN
--split input string
WITH cte_1
AS (
SELECT regexp_substr(strRangeStr, '[^;]+', 1, LEVEL) AS TO_FROM_STRING
FROM dual connect BY regexp_substr(strRangeStr, '[^;]+', 1, LEVEL) IS NOT NULL
)
--split From/To pairs
,cte_2
AS (
SELECT cte_1.TO_FROM_STRING
,to_number(substr(cte_1.TO_FROM_STRING, 1, instr(cte_1.TO_FROM_STRING, ':') - 1)) AS FROM_MEAS
,to_number(substr(cte_1.TO_FROM_STRING, instr(cte_1.TO_FROM_STRING, ':') + 1, length(cte_1.TO_FROM_STRING) - instr(cte_1.TO_FROM_STRING, ':'))) AS TO_MEAS
FROM cte_1
)
--merge ranges
,cte_merge_ranges
AS (
SELECT s1.FROM_MEAS
,
--t1.TO_MEAS
MIN(t1.TO_MEAS) AS TO_MEAS
FROM cte_2 s1
INNER JOIN cte_2 t1 ON s1.FROM_MEAS <= t1.TO_MEAS
AND NOT EXISTS (
SELECT *
FROM cte_2 t2
WHERE t1.TO_MEAS >= t2.FROM_MEAS
AND t1.TO_MEAS < t2.TO_MEAS
)
WHERE NOT EXISTS (
SELECT *
FROM cte_2 s2
WHERE s1.FROM_MEAS > s2.FROM_MEAS
AND s1.FROM_MEAS <= s2.TO_MEAS
)
GROUP BY s1.FROM_MEAS
)
SELECT sum(TO_MEAS - FROM_MEAS) AS DISTANCE_COVERED
INTO intRetNum
FROM cte_merge_ranges;
RETURN intRetNum;
END;
然后编写此查询,为适当的先验范围构建该函数的字符串。无法将窗口与 list_agg 一起使用,但能够通过相关子查询实现同样的效果。
--use list agg to create list of to/from pairs for rows before current row in the ordering
WITH cte_2
AS (
SELECT T1.*
,(
SELECT LISTAGG(FROM_MEAS || ':' || TO_MEAS || ';') WITHIN
GROUP (
ORDER BY ORDER BY YEAR DESC, EVENT_ID DESC
)
FROM road_events T2
WHERE T1.YEAR || lpad(T1.EVENT_ID, 10,'0') <
T2.YEAR || lpad(T2.EVENT_ID, 10,'0')
AND T1.ROAD_ID = T2.ROAD_ID
GROUP BY road_id
) AS PRIOR_RANGES_STR
FROM road_events T1
)
--get distance for prior range string - distance ignoring current row
--get distance including current row
,cte_3
AS (
SELECT cte_2.*
,coalesce(get_distance_range_str(PRIOR_RANGES_STR), 0) AS DIST_PRIOR
,get_distance_range_str(PRIOR_RANGES_STR || FROM_MEAS || ':' || TO_MEAS || ';') AS DIST_NOW
FROM cte_2 cte_2
)
--distance including current row less distance ignoring current row is distance added to the range this row
,cte_4
AS (
SELECT cte_3.*
,DIST_NOW - DIST_PRIOR AS DIST_ADDED_THIS_ROW
FROM cte_3
)
SELECT *
FROM cte_4
--filter out any rows with distance added as 0
WHERE DIST_ADDED_THIS_ROW > 0
ORDER BY ROAD_ID, YEAR DESC, EVENT_ID DESC
sqlfiddle 在这里:http://sqlfiddle.com/#!4/81331/36
在我看来,结果与您的相符。我在最终查询中保留了额外的列以尝试说明每个步骤。
处理测试用例 - 可能需要一些工作来处理更大数据集中的所有可能性,但我认为这是开始和改进的好地方。
重叠范围合并的功劳是这里的第一个答案:Merge overlapping date intervals
带窗口的 list_agg 的功劳是这里的第一个答案:
LISTAGG equivalent with windowing clause
我的主要 DBMS 是 Teradata,但这也可以在 Oracle 中按原样运行。
WITH all_meas AS
( -- get a distinct list of all from/to points
SELECT road_id, from_meas AS meas
FROM road_events
UNION
SELECT road_id, to_meas
FROM road_events
)
-- select * from all_meas order by 1,2
, all_ranges AS
( -- create from/to ranges
SELECT road_id, meas AS from_meas
,Lead(meas)
Over (PARTITION BY road_id
ORDER BY meas) AS to_meas
FROM all_meas
)
-- SELECT * from all_ranges order by 1,2
, all_event_ranges AS
( -- now match the ranges to the event ranges
SELECT
ar.*
,re.event_id
,re.year
,re.total_road_length
,ar.to_meas - ar.from_meas AS event_length
-- used to filter the latest event as multiple events might cover the same range
,Row_Number()
Over (PARTITION BY ar.road_id, ar.from_meas
ORDER BY year DESC) AS rn
FROM all_ranges ar
JOIN road_events re
ON ar.road_id = re.road_id
AND ar.from_meas < re.to_meas
AND ar.to_meas > re.from_meas
WHERE ar.to_meas IS NOT NULL
)
SELECT event_id, road_id, year, total_road_length, Sum(event_length)
FROM all_event_ranges
WHERE rn = 1 -- latest year only
GROUP BY event_id, road_id, year, total_road_length
ORDER BY road_id, year DESC;
如果您需要 return 实际覆盖 from/to_meas
(如您在编辑前的问题),可能会更复杂。第一部分是相同的,但如果没有聚合,查询可以 return 具有相同 event_id 的相邻行(例如,对于事件 3:0-1 和 1-25):
SELECT * FROM all_event_ranges
WHERE rn = 1
ORDER BY road_id, from_meas;
如果你想合并相邻的行,你还需要两个步骤(使用标准方法,标记组的第一行并计算组号):
WITH all_meas AS
(
SELECT road_id, from_meas AS meas
FROM road_events
UNION
SELECT road_id, to_meas
FROM road_events
)
-- select * from all_meas order by 1,2
, all_ranges AS
(
SELECT road_id, meas AS from_meas
,Lead(meas)
Over (PARTITION BY road_id
ORDER BY meas) AS to_meas
FROM all_meas
)
-- SELECT * from all_ranges order by 1,2
, all_event_ranges AS
(
SELECT
ar.*
,re.event_id
,re.year
,re.total_road_length
,ar.to_meas - ar.from_meas AS event_length
,Row_Number()
Over (PARTITION BY ar.road_id, ar.from_meas
ORDER BY year DESC) AS rn
FROM all_ranges ar
JOIN road_events re
ON ar.road_id = re.road_id
AND ar.from_meas < re.to_meas
AND ar.to_meas > re.from_meas
WHERE ar.to_meas IS NOT NULL
)
-- SELECT * FROM all_event_ranges WHERE rn = 1 ORDER BY road_id, from_meas
, adjacent_events AS
( -- assign 1 to the 1st row of an event
SELECT t.*
,CASE WHEN Lag(event_id)
Over(PARTITION BY road_id
ORDER BY from_meas) = event_id
THEN 0
ELSE 1
END AS flag
FROM all_event_ranges t
WHERE rn = 1
)
-- SELECT * FROM adjacent_events ORDER BY road_id, from_meas
, grouped_events AS
( -- assign a groupnumber to adjacent rows using a Cumulative Sum over 0/1
SELECT t.*
,Sum(flag)
Over (PARTITION BY road_id
ORDER BY from_meas
ROWS Unbounded Preceding) AS grp
FROM adjacent_events t
)
-- SELECT * FROM grouped_events ORDER BY road_id, from_meas
SELECT event_id, road_id, year, Min(from_meas), Max(to_meas), total_road_length, Sum(event_length)
FROM grouped_events
GROUP BY event_id, road_id, grp, year, total_road_length
ORDER BY 2, Min(from_meas);
编辑:
ups,我刚刚发现一个博客 Overlapping ranges with priority 做完全相同的一些简化的 Oracle 语法。事实上,我将查询从 Teradata 中的其他一些简化语法翻译为 Standard/Oracle SQL :-)
我对你的 "road events" 有疑问,因为你没有描述什么是 1st meas
,我认为它是 0 和 1 之间的周期,没有 1。
所以,你可以用一个查询来计算:
with newest_MEAS as (
select ROAD_ID, MEAS.m, max(year) y
from road_events
join (select rownum -1 m
from dual
connect by rownum -1 <= (select max(TOTAL_ROAD_LENGTH) from road_events) ) MEAS
on MEAS.m between FROM_MEAS and TO_MEAS
group by ROAD_ID, MEAS.m )
select re.event_id, nm.ROAD_ID, re.total_road_length, nm.y, count(nm.m) EVENT_LENGTH
from newest_MEAS nm
join road_events re
on nm.ROAD_ID = re.ROAD_ID
and nm.m between re.from_meas and re.to_meas -1
and nm.y = re.year
group by re.event_id, nm.ROAD_ID, re.total_road_length, nm.y
order by event_id
还有另一种计算方法,使用 from 和 to 值:
with
part_begin_point as (
Select distinct road_id, from_meas point
from road_events be
union
Select distinct road_id, to_meas point
from road_events ee
)
, newest_part as (
select e.event_id
, e.road_id
, e.year
, e.total_road_length
, p.point
, LAG(e.event_id) over (partition by p.road_id order by p.point) prev_event
, e.to_meas event_to_meas
from part_begin_point p
join road_events e
on p.road_id = e.road_id
and p.point >= e.from_meas and p.point < e.to_meas
and not exists(
select 1 from road_events ne
where e.road_id = ne.road_id
and p.point >= ne.from_meas and p.point < ne.to_meas
and (e.year < ne.year or e.year = ne.year and e.event_id < ne.event_id))
)
select event_id, road_id, year
, point from_meas
, LEAD(point, 1, event_to_meas) over (partition by road_id order by point) to_meas
, total_road_length
, LEAD(point, 1, event_to_meas) over (partition by road_id order by point) - point EVENT_LENGTH
from newest_part
where 1=1
and event_id <> prev_event or prev_event is null
order by event_id, point
这个发现扩展了 table 以针对每条道路的每一英里生成一行,并且只需要 MAX
年。然后我们可以 COUNT
行数来生成 event_length.
它生成的 table 与您在上面指定的完全一样。
注意:我 运行 此查询针对 SQL 服务器。我认为您可以在 Oracle 中使用 LEAST
而不是 SELECT MIN(event_length) FROM (VALUES...)
。
WITH NumberRange(result) AS
(
SELECT 0
UNION ALL
SELECT result + 1
FROM NumberRange
WHERE result < 301 --Max length of any road
),
CurrentRoadEventLength(road_id, [year], event_length) AS
(
SELECT road_id, [year], COUNT(*) AS event_length
FROM (
SELECT re.road_id, n.result, MAX(re.[year]) as [year]
FROM road_events re INNER JOIN NumberRange n
ON ( re.from_meas <= n.result
AND re.to_meas > n.result
)
GROUP BY re.road_id, n.result
) events_per_mile
GROUP BY road_id, [year]
)
SELECT re.event_id, re.road_id, re.[year], re.total_road_length,
(SELECT MIN(event_length) FROM (VALUES (re.to_meas - re.from_meas), (cre.event_length)) AS EventLengths(event_length))
FROM road_events re INNER JOIN CurrentRoadEventLength cre
ON ( re.road_id = cre.road_id
AND re.[year] = cre.[year]
)
ORDER BY re.event_id, re.road_id
OPTION (MAXRECURSION 301) --Max length of any road
解法:
SELECT RE.road_id, RE.event_id, RE.year, RE.from_meas, RE.to_meas, RE.road_length, RE.event_length, RE.used_length, RE.leftover_length
FROM
(
SELECT RE.C_road_id[road_id], RE.C_event_id[event_id], RE.C_year[year], RE.C_from_meas[from_meas], RE.C_to_meas[to_meas], RE.C_road_length[road_length],
RE.event_length, RE.used_length, (RE.event_length - (CASE WHEN RE.HasOverlap = 1 THEN RE.used_length ELSE 0 END))[leftover_length]
FROM
(
SELECT RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
(CASE WHEN MAX(RE.A_event_id) IS NOT NULL THEN 1 ELSE 0 END)[HasOverlap],
(RE.C_to_meas - RE.C_from_meas)[event_length],
SUM( (CASE WHEN RE.O_to_meas <= RE.C_to_meas THEN RE.O_to_meas ELSE RE.C_to_meas END)
- (CASE WHEN RE.O_from_meas >= RE.C_from_meas THEN RE.O_from_meas ELSE RE.C_from_meas END)
)[used_length]--This is the length that is already being counted towards later years.
FROM
(
SELECT RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
RE.A_event_id, MIN(RE.O_from_meas)[O_from_meas], MAX(RE.O_to_meas)[O_to_meas]
FROM
(
SELECT RE_C.road_id[C_road_id], RE_C.event_id[C_event_id], RE_C.year[C_year], RE_C.from_meas[C_from_meas], RE_C.to_meas[C_to_meas], RE_C.total_road_length[C_road_length],
RE_A.road_id[A_road_id], RE_A.event_id[A_event_id], RE_A.year[A_year], RE_A.from_meas[A_from_meas], RE_A.to_meas[A_to_meas], RE_A.total_road_length[A_road_length],
RE_O.road_id[O_road_id], RE_O.event_id[O_event_id], RE_O.year[O_year], RE_O.from_meas[O_from_meas], RE_O.to_meas[O_to_meas], RE_O.total_road_length[O_road_length],
(ROW_NUMBER() OVER (PARTITION BY RE_C.road_id, RE_C.event_id, RE_O.event_id ORDER BY RE_S.Overlap DESC, RE_A.event_id))[RowNum]--Use to Group Overlaps into Swaths.
FROM road_events as RE_C--Current.
LEFT JOIN road_events as RE_A--After. --Use a Left-Join to capture when there is only 1 Event (or it is the Last-Event in the list).
ON RE_A.road_id = RE_C.road_id
AND RE_A.event_id != RE_C.event_id--Not the same EventID.
AND RE_A.year >= RE_C.year--Occured on or After the Current Event.
AND ( (RE_A.from_meas >= RE_C.from_meas AND RE_A.from_meas <= RE_C.to_meas)--There is Overlap.
OR (RE_A.to_meas >= RE_C.from_meas AND RE_A.to_meas <= RE_C.to_meas)--There is Overlap.
OR (RE_A.to_meas = RE_C.to_meas AND RE_A.from_meas = RE_C.from_meas)--They are Equal.
)
LEFT JOIN road_events as RE_O--Overlapped/Linked.
ON RE_O.road_id = RE_C.road_id
AND RE_O.event_id != RE_C.event_id--Not the same EventID.
AND RE_O.year >= RE_C.year--Occured on or After the Current Event.
AND ( (RE_O.from_meas >= RE_A.from_meas AND RE_O.from_meas <= RE_A.to_meas)--There is Overlap.
OR (RE_O.to_meas >= RE_A.from_meas AND RE_O.to_meas <= RE_A.to_meas)--There is Overlap.
OR (RE_O.to_meas = RE_A.to_meas AND RE_O.from_meas = RE_A.from_meas)--They are Equal.
)
OUTER APPLY
(
SELECT COUNT(*)[Overlap]
FROM road_events as RE_O--Overlapped/Linked.
WHERE RE_O.road_id = RE_C.road_id
AND RE_O.event_id != RE_C.event_id--Not the same EventID.
AND RE_O.year >= RE_C.year--Occured on or After the Current Event.
AND ( (RE_O.from_meas >= RE_A.from_meas AND RE_O.from_meas <= RE_A.to_meas)--There is Overlap.
OR (RE_O.to_meas >= RE_A.from_meas AND RE_O.to_meas <= RE_A.to_meas)--There is Overlap.
OR (RE_O.to_meas = RE_A.to_meas AND RE_O.from_meas = RE_A.from_meas)--They are Equal.
)
) AS RE_S--Swath of Overlaps.
) AS RE
WHERE RE.RowNum = 1--Remove Duplicates and Select those that are in the biggest Swaths.
GROUP BY RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
RE.A_event_id
) AS RE
GROUP BY RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length
) AS RE
) AS RE
WHERE RE.leftover_length > 0--Filter out Events that had their entire Segments overlapped by a Later Event(s).
ORDER BY RE.road_id, RE.year DESC, RE.event_id
SQL Fiddle:
http://sqlfiddle.com/#!18/2880b/1
已添加Rules/Assumptions/Clarifications:
1.) 允许 event_id
和 road_id
可能是 Guid 的或创建的 out-of-order,
所以不要假设更高或更低的值赋予记录关系意义。
例如:
ID 为 1 和 ID 为 2 不保证 ID 2 是最新的(并且 vice-versa)。
这样解决方案将更通用,更少 "hacky".
2.) 过滤掉其整个片段被后续事件重叠的事件。
例如:
如果 2008 年的工作时间为 20-50,而 2009 年的工作时间为 10-60,
那么 2008 年的事件将被过滤掉,因为它的整个段在 2009 年被重新散列。
附加测试数据:
为确保解决方案不仅仅针对给定的数据集,
我在原来的DataSet中加了一个6的road_id
,为了多打几个fringe-cases.
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (16,6,2012,0,100,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (17,6,2013,68,69,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (18,6,2014,65,66,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (19,6,2015,62,63,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (20,6,2016,50,60,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (21,6,2017,30,40,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (22,6,2017,20,55,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (23,6,2018,0,25,100);
结果:(我在 Green[=112] 中添加的 8 附加记录=])
数据库版本:
此解决方案是 Oracle 并且 SQL-Server 不可知论者:
它应该在 SS2008+ 和 Oracle 12c+.
中工作
这个问题被标记为Oracle 12c,但是没有online-fiddle我可以不注册就用,
所以我在 SQL-Server 中对其进行了测试 - 但相同的语法应该适用于两者。
我的大部分查询都依赖 Cross-Apply
和 Outer-Apply
。
Oracle 在 12c 中引入了这些 "Joins":
https://oracle-base.com/articles/12c/lateral-inline-views-cross-apply-and-outer-apply-joins-12cr1
简化且高效:
这使用:
• 无相关子查询。
• 无递归。
• 无 CTE。
• 没有工会。
• 无用户函数。
索引:
我读了你问过的关于索引的评论之一。
我会为您要搜索和分组的每个主要字段添加 1 列索引:
road_id
、event_id
和 year
。
你可以看看这个索引是否对你有帮助(我不知道你打算如何使用这些数据):
关键字段: road_id
、event_id
、year
包括:from_meas
、to_meas
标题:
您可能需要考虑将此问题的标题重命名为更易于搜索的内容,例如:
“聚合重叠段以测量有效长度”。
这将使解决方案更容易找到,以帮助其他有类似问题的人。
其他想法:
像这样的东西在计算 Overall-Time 花费在某事
上会很有用
具有重叠的开始和停止时间戳。
我有一个 road_events
table:
create table road_events (
event_id number(4,0),
road_id number(4,0),
year number(4,0),
from_meas number(10,2),
to_meas number(10,2),
total_road_length number(10,2)
);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (1,1,2020,25,50,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (2,1,2000,25,50,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (3,1,1980,0,25,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (4,1,1960,75,100,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (5,1,1940,1,100,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (6,2,2000,10,30,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (7,2,1975,30,60,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (8,2,1950,50,90,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (9,3,2050,40,90,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (10,4,2040,0,200,200);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (11,4,2013,0,199,200);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (12,4,2001,0,200,200);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (13,5,1985,50,70,300);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (14,5,1985,10,50,300);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (15,5,1965,1,301,300);
commit;
select * from road_events;
EVENT_ID ROAD_ID YEAR FROM_MEAS TO_MEAS TOTAL_ROAD_LENGTH
---------- ---------- ---------- ---------- ---------- -----------------
1 1 2020 25 50 100
2 1 2000 25 50 100
3 1 1980 0 25 100
4 1 1960 75 100 100
5 1 1940 1 100 100
6 2 2000 10 30 100
7 2 1975 30 60 100
8 2 1950 50 90 100
9 3 2050 40 90 100
10 4 2040 0 200 200
11 4 2013 0 199 200
12 4 2001 0 200 200
13 5 1985 50 70 300
14 5 1985 10 50 300
15 5 1965 1 301 300
我想select表示每条道路上最近工作的事件。
这是一个棘手的操作,因为事件通常只涉及道路的一部分。这意味着我不能简单地 select 每条道路的最新事件;我只需要 select 最近 不重叠的事件里程 。
可能的逻辑(按顺序):
我不愿意猜测这个问题是如何解决的,因为它最终可能弊大于利(有点像 XY Problem)。另一方面,它可能会提供对问题本质的洞察,所以这里是:
- Select 每条道路的最近事件。我们将调用最近的事件:
event A
. - 如果
event A
是>= total_road_length
,那么这就是我所需要的。算法到此结束。 - 否则,获取与
event A
具有不同范围的下一个按时间顺序排列的事件 (event B
)。 - 如果
event B
的范围与event A
的范围重叠,则只获取event B
不重叠的部分。 - 重复步骤 3 和 4,直到事件总长度为
= total_road_length
。或者当这条路没有更多事件时停止。
问题:
我知道这是一项艰巨的任务,但是 要做到这一点需要什么?
这是一个经典的线性参考问题。如果我可以将线性引用操作作为查询的一部分,那将非常有帮助。
结果将是:
EVENT_ID ROAD_ID YEAR TOTAL_ROAD_LENGTH EVENT_LENGTH
---------- ---------- ---------- ----------------- ------------
1 1 2020 100 25
3 1 1980 100 25
4 1 1960 100 25
5 1 1940 100 25
6 2 2000 100 20
7 2 1975 100 30
8 2 1950 100 30
9 3 2050 100 50
10 4 2040 200 200
13 5 1985 300 20
14 5 1985 300 40
15 5 1965 300 240
相关问题:
今天想太多了,但是我现在有一些东西忽略了+/- 10米。
首先创建了一个函数,将成对作为字符串接收/从成对接收,并且 returns 字符串中成对所覆盖的距离。例如 '10:20;35:45' returns 20.
CREATE
OR replace FUNCTION get_distance_range_str (strRangeStr VARCHAR2)
RETURN NUMBER IS intRetNum NUMBER;
BEGIN
--split input string
WITH cte_1
AS (
SELECT regexp_substr(strRangeStr, '[^;]+', 1, LEVEL) AS TO_FROM_STRING
FROM dual connect BY regexp_substr(strRangeStr, '[^;]+', 1, LEVEL) IS NOT NULL
)
--split From/To pairs
,cte_2
AS (
SELECT cte_1.TO_FROM_STRING
,to_number(substr(cte_1.TO_FROM_STRING, 1, instr(cte_1.TO_FROM_STRING, ':') - 1)) AS FROM_MEAS
,to_number(substr(cte_1.TO_FROM_STRING, instr(cte_1.TO_FROM_STRING, ':') + 1, length(cte_1.TO_FROM_STRING) - instr(cte_1.TO_FROM_STRING, ':'))) AS TO_MEAS
FROM cte_1
)
--merge ranges
,cte_merge_ranges
AS (
SELECT s1.FROM_MEAS
,
--t1.TO_MEAS
MIN(t1.TO_MEAS) AS TO_MEAS
FROM cte_2 s1
INNER JOIN cte_2 t1 ON s1.FROM_MEAS <= t1.TO_MEAS
AND NOT EXISTS (
SELECT *
FROM cte_2 t2
WHERE t1.TO_MEAS >= t2.FROM_MEAS
AND t1.TO_MEAS < t2.TO_MEAS
)
WHERE NOT EXISTS (
SELECT *
FROM cte_2 s2
WHERE s1.FROM_MEAS > s2.FROM_MEAS
AND s1.FROM_MEAS <= s2.TO_MEAS
)
GROUP BY s1.FROM_MEAS
)
SELECT sum(TO_MEAS - FROM_MEAS) AS DISTANCE_COVERED
INTO intRetNum
FROM cte_merge_ranges;
RETURN intRetNum;
END;
然后编写此查询,为适当的先验范围构建该函数的字符串。无法将窗口与 list_agg 一起使用,但能够通过相关子查询实现同样的效果。
--use list agg to create list of to/from pairs for rows before current row in the ordering
WITH cte_2
AS (
SELECT T1.*
,(
SELECT LISTAGG(FROM_MEAS || ':' || TO_MEAS || ';') WITHIN
GROUP (
ORDER BY ORDER BY YEAR DESC, EVENT_ID DESC
)
FROM road_events T2
WHERE T1.YEAR || lpad(T1.EVENT_ID, 10,'0') <
T2.YEAR || lpad(T2.EVENT_ID, 10,'0')
AND T1.ROAD_ID = T2.ROAD_ID
GROUP BY road_id
) AS PRIOR_RANGES_STR
FROM road_events T1
)
--get distance for prior range string - distance ignoring current row
--get distance including current row
,cte_3
AS (
SELECT cte_2.*
,coalesce(get_distance_range_str(PRIOR_RANGES_STR), 0) AS DIST_PRIOR
,get_distance_range_str(PRIOR_RANGES_STR || FROM_MEAS || ':' || TO_MEAS || ';') AS DIST_NOW
FROM cte_2 cte_2
)
--distance including current row less distance ignoring current row is distance added to the range this row
,cte_4
AS (
SELECT cte_3.*
,DIST_NOW - DIST_PRIOR AS DIST_ADDED_THIS_ROW
FROM cte_3
)
SELECT *
FROM cte_4
--filter out any rows with distance added as 0
WHERE DIST_ADDED_THIS_ROW > 0
ORDER BY ROAD_ID, YEAR DESC, EVENT_ID DESC
sqlfiddle 在这里:http://sqlfiddle.com/#!4/81331/36
在我看来,结果与您的相符。我在最终查询中保留了额外的列以尝试说明每个步骤。
处理测试用例 - 可能需要一些工作来处理更大数据集中的所有可能性,但我认为这是开始和改进的好地方。
重叠范围合并的功劳是这里的第一个答案:Merge overlapping date intervals
带窗口的 list_agg 的功劳是这里的第一个答案: LISTAGG equivalent with windowing clause
我的主要 DBMS 是 Teradata,但这也可以在 Oracle 中按原样运行。
WITH all_meas AS
( -- get a distinct list of all from/to points
SELECT road_id, from_meas AS meas
FROM road_events
UNION
SELECT road_id, to_meas
FROM road_events
)
-- select * from all_meas order by 1,2
, all_ranges AS
( -- create from/to ranges
SELECT road_id, meas AS from_meas
,Lead(meas)
Over (PARTITION BY road_id
ORDER BY meas) AS to_meas
FROM all_meas
)
-- SELECT * from all_ranges order by 1,2
, all_event_ranges AS
( -- now match the ranges to the event ranges
SELECT
ar.*
,re.event_id
,re.year
,re.total_road_length
,ar.to_meas - ar.from_meas AS event_length
-- used to filter the latest event as multiple events might cover the same range
,Row_Number()
Over (PARTITION BY ar.road_id, ar.from_meas
ORDER BY year DESC) AS rn
FROM all_ranges ar
JOIN road_events re
ON ar.road_id = re.road_id
AND ar.from_meas < re.to_meas
AND ar.to_meas > re.from_meas
WHERE ar.to_meas IS NOT NULL
)
SELECT event_id, road_id, year, total_road_length, Sum(event_length)
FROM all_event_ranges
WHERE rn = 1 -- latest year only
GROUP BY event_id, road_id, year, total_road_length
ORDER BY road_id, year DESC;
如果您需要 return 实际覆盖 from/to_meas
(如您在编辑前的问题),可能会更复杂。第一部分是相同的,但如果没有聚合,查询可以 return 具有相同 event_id 的相邻行(例如,对于事件 3:0-1 和 1-25):
SELECT * FROM all_event_ranges
WHERE rn = 1
ORDER BY road_id, from_meas;
如果你想合并相邻的行,你还需要两个步骤(使用标准方法,标记组的第一行并计算组号):
WITH all_meas AS
(
SELECT road_id, from_meas AS meas
FROM road_events
UNION
SELECT road_id, to_meas
FROM road_events
)
-- select * from all_meas order by 1,2
, all_ranges AS
(
SELECT road_id, meas AS from_meas
,Lead(meas)
Over (PARTITION BY road_id
ORDER BY meas) AS to_meas
FROM all_meas
)
-- SELECT * from all_ranges order by 1,2
, all_event_ranges AS
(
SELECT
ar.*
,re.event_id
,re.year
,re.total_road_length
,ar.to_meas - ar.from_meas AS event_length
,Row_Number()
Over (PARTITION BY ar.road_id, ar.from_meas
ORDER BY year DESC) AS rn
FROM all_ranges ar
JOIN road_events re
ON ar.road_id = re.road_id
AND ar.from_meas < re.to_meas
AND ar.to_meas > re.from_meas
WHERE ar.to_meas IS NOT NULL
)
-- SELECT * FROM all_event_ranges WHERE rn = 1 ORDER BY road_id, from_meas
, adjacent_events AS
( -- assign 1 to the 1st row of an event
SELECT t.*
,CASE WHEN Lag(event_id)
Over(PARTITION BY road_id
ORDER BY from_meas) = event_id
THEN 0
ELSE 1
END AS flag
FROM all_event_ranges t
WHERE rn = 1
)
-- SELECT * FROM adjacent_events ORDER BY road_id, from_meas
, grouped_events AS
( -- assign a groupnumber to adjacent rows using a Cumulative Sum over 0/1
SELECT t.*
,Sum(flag)
Over (PARTITION BY road_id
ORDER BY from_meas
ROWS Unbounded Preceding) AS grp
FROM adjacent_events t
)
-- SELECT * FROM grouped_events ORDER BY road_id, from_meas
SELECT event_id, road_id, year, Min(from_meas), Max(to_meas), total_road_length, Sum(event_length)
FROM grouped_events
GROUP BY event_id, road_id, grp, year, total_road_length
ORDER BY 2, Min(from_meas);
编辑:
ups,我刚刚发现一个博客 Overlapping ranges with priority 做完全相同的一些简化的 Oracle 语法。事实上,我将查询从 Teradata 中的其他一些简化语法翻译为 Standard/Oracle SQL :-)
我对你的 "road events" 有疑问,因为你没有描述什么是 1st meas
,我认为它是 0 和 1 之间的周期,没有 1。
所以,你可以用一个查询来计算:
with newest_MEAS as (
select ROAD_ID, MEAS.m, max(year) y
from road_events
join (select rownum -1 m
from dual
connect by rownum -1 <= (select max(TOTAL_ROAD_LENGTH) from road_events) ) MEAS
on MEAS.m between FROM_MEAS and TO_MEAS
group by ROAD_ID, MEAS.m )
select re.event_id, nm.ROAD_ID, re.total_road_length, nm.y, count(nm.m) EVENT_LENGTH
from newest_MEAS nm
join road_events re
on nm.ROAD_ID = re.ROAD_ID
and nm.m between re.from_meas and re.to_meas -1
and nm.y = re.year
group by re.event_id, nm.ROAD_ID, re.total_road_length, nm.y
order by event_id
还有另一种计算方法,使用 from 和 to 值:
with
part_begin_point as (
Select distinct road_id, from_meas point
from road_events be
union
Select distinct road_id, to_meas point
from road_events ee
)
, newest_part as (
select e.event_id
, e.road_id
, e.year
, e.total_road_length
, p.point
, LAG(e.event_id) over (partition by p.road_id order by p.point) prev_event
, e.to_meas event_to_meas
from part_begin_point p
join road_events e
on p.road_id = e.road_id
and p.point >= e.from_meas and p.point < e.to_meas
and not exists(
select 1 from road_events ne
where e.road_id = ne.road_id
and p.point >= ne.from_meas and p.point < ne.to_meas
and (e.year < ne.year or e.year = ne.year and e.event_id < ne.event_id))
)
select event_id, road_id, year
, point from_meas
, LEAD(point, 1, event_to_meas) over (partition by road_id order by point) to_meas
, total_road_length
, LEAD(point, 1, event_to_meas) over (partition by road_id order by point) - point EVENT_LENGTH
from newest_part
where 1=1
and event_id <> prev_event or prev_event is null
order by event_id, point
这个发现扩展了 table 以针对每条道路的每一英里生成一行,并且只需要 MAX
年。然后我们可以 COUNT
行数来生成 event_length.
它生成的 table 与您在上面指定的完全一样。
注意:我 运行 此查询针对 SQL 服务器。我认为您可以在 Oracle 中使用 LEAST
而不是 SELECT MIN(event_length) FROM (VALUES...)
。
WITH NumberRange(result) AS
(
SELECT 0
UNION ALL
SELECT result + 1
FROM NumberRange
WHERE result < 301 --Max length of any road
),
CurrentRoadEventLength(road_id, [year], event_length) AS
(
SELECT road_id, [year], COUNT(*) AS event_length
FROM (
SELECT re.road_id, n.result, MAX(re.[year]) as [year]
FROM road_events re INNER JOIN NumberRange n
ON ( re.from_meas <= n.result
AND re.to_meas > n.result
)
GROUP BY re.road_id, n.result
) events_per_mile
GROUP BY road_id, [year]
)
SELECT re.event_id, re.road_id, re.[year], re.total_road_length,
(SELECT MIN(event_length) FROM (VALUES (re.to_meas - re.from_meas), (cre.event_length)) AS EventLengths(event_length))
FROM road_events re INNER JOIN CurrentRoadEventLength cre
ON ( re.road_id = cre.road_id
AND re.[year] = cre.[year]
)
ORDER BY re.event_id, re.road_id
OPTION (MAXRECURSION 301) --Max length of any road
解法:
SELECT RE.road_id, RE.event_id, RE.year, RE.from_meas, RE.to_meas, RE.road_length, RE.event_length, RE.used_length, RE.leftover_length
FROM
(
SELECT RE.C_road_id[road_id], RE.C_event_id[event_id], RE.C_year[year], RE.C_from_meas[from_meas], RE.C_to_meas[to_meas], RE.C_road_length[road_length],
RE.event_length, RE.used_length, (RE.event_length - (CASE WHEN RE.HasOverlap = 1 THEN RE.used_length ELSE 0 END))[leftover_length]
FROM
(
SELECT RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
(CASE WHEN MAX(RE.A_event_id) IS NOT NULL THEN 1 ELSE 0 END)[HasOverlap],
(RE.C_to_meas - RE.C_from_meas)[event_length],
SUM( (CASE WHEN RE.O_to_meas <= RE.C_to_meas THEN RE.O_to_meas ELSE RE.C_to_meas END)
- (CASE WHEN RE.O_from_meas >= RE.C_from_meas THEN RE.O_from_meas ELSE RE.C_from_meas END)
)[used_length]--This is the length that is already being counted towards later years.
FROM
(
SELECT RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
RE.A_event_id, MIN(RE.O_from_meas)[O_from_meas], MAX(RE.O_to_meas)[O_to_meas]
FROM
(
SELECT RE_C.road_id[C_road_id], RE_C.event_id[C_event_id], RE_C.year[C_year], RE_C.from_meas[C_from_meas], RE_C.to_meas[C_to_meas], RE_C.total_road_length[C_road_length],
RE_A.road_id[A_road_id], RE_A.event_id[A_event_id], RE_A.year[A_year], RE_A.from_meas[A_from_meas], RE_A.to_meas[A_to_meas], RE_A.total_road_length[A_road_length],
RE_O.road_id[O_road_id], RE_O.event_id[O_event_id], RE_O.year[O_year], RE_O.from_meas[O_from_meas], RE_O.to_meas[O_to_meas], RE_O.total_road_length[O_road_length],
(ROW_NUMBER() OVER (PARTITION BY RE_C.road_id, RE_C.event_id, RE_O.event_id ORDER BY RE_S.Overlap DESC, RE_A.event_id))[RowNum]--Use to Group Overlaps into Swaths.
FROM road_events as RE_C--Current.
LEFT JOIN road_events as RE_A--After. --Use a Left-Join to capture when there is only 1 Event (or it is the Last-Event in the list).
ON RE_A.road_id = RE_C.road_id
AND RE_A.event_id != RE_C.event_id--Not the same EventID.
AND RE_A.year >= RE_C.year--Occured on or After the Current Event.
AND ( (RE_A.from_meas >= RE_C.from_meas AND RE_A.from_meas <= RE_C.to_meas)--There is Overlap.
OR (RE_A.to_meas >= RE_C.from_meas AND RE_A.to_meas <= RE_C.to_meas)--There is Overlap.
OR (RE_A.to_meas = RE_C.to_meas AND RE_A.from_meas = RE_C.from_meas)--They are Equal.
)
LEFT JOIN road_events as RE_O--Overlapped/Linked.
ON RE_O.road_id = RE_C.road_id
AND RE_O.event_id != RE_C.event_id--Not the same EventID.
AND RE_O.year >= RE_C.year--Occured on or After the Current Event.
AND ( (RE_O.from_meas >= RE_A.from_meas AND RE_O.from_meas <= RE_A.to_meas)--There is Overlap.
OR (RE_O.to_meas >= RE_A.from_meas AND RE_O.to_meas <= RE_A.to_meas)--There is Overlap.
OR (RE_O.to_meas = RE_A.to_meas AND RE_O.from_meas = RE_A.from_meas)--They are Equal.
)
OUTER APPLY
(
SELECT COUNT(*)[Overlap]
FROM road_events as RE_O--Overlapped/Linked.
WHERE RE_O.road_id = RE_C.road_id
AND RE_O.event_id != RE_C.event_id--Not the same EventID.
AND RE_O.year >= RE_C.year--Occured on or After the Current Event.
AND ( (RE_O.from_meas >= RE_A.from_meas AND RE_O.from_meas <= RE_A.to_meas)--There is Overlap.
OR (RE_O.to_meas >= RE_A.from_meas AND RE_O.to_meas <= RE_A.to_meas)--There is Overlap.
OR (RE_O.to_meas = RE_A.to_meas AND RE_O.from_meas = RE_A.from_meas)--They are Equal.
)
) AS RE_S--Swath of Overlaps.
) AS RE
WHERE RE.RowNum = 1--Remove Duplicates and Select those that are in the biggest Swaths.
GROUP BY RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
RE.A_event_id
) AS RE
GROUP BY RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length
) AS RE
) AS RE
WHERE RE.leftover_length > 0--Filter out Events that had their entire Segments overlapped by a Later Event(s).
ORDER BY RE.road_id, RE.year DESC, RE.event_id
SQL Fiddle:
http://sqlfiddle.com/#!18/2880b/1
已添加Rules/Assumptions/Clarifications:
1.) 允许 event_id
和 road_id
可能是 Guid 的或创建的 out-of-order,
所以不要假设更高或更低的值赋予记录关系意义。
例如:
ID 为 1 和 ID 为 2 不保证 ID 2 是最新的(并且 vice-versa)。
这样解决方案将更通用,更少 "hacky".
2.) 过滤掉其整个片段被后续事件重叠的事件。
例如:
如果 2008 年的工作时间为 20-50,而 2009 年的工作时间为 10-60,
那么 2008 年的事件将被过滤掉,因为它的整个段在 2009 年被重新散列。
附加测试数据:
为确保解决方案不仅仅针对给定的数据集,
我在原来的DataSet中加了一个6的road_id
,为了多打几个fringe-cases.
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (16,6,2012,0,100,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (17,6,2013,68,69,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (18,6,2014,65,66,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (19,6,2015,62,63,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (20,6,2016,50,60,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (21,6,2017,30,40,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (22,6,2017,20,55,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (23,6,2018,0,25,100);
结果:(我在 Green[=112] 中添加的 8 附加记录=])
数据库版本:
此解决方案是 Oracle 并且 SQL-Server 不可知论者:
它应该在 SS2008+ 和 Oracle 12c+.
这个问题被标记为Oracle 12c,但是没有online-fiddle我可以不注册就用,
所以我在 SQL-Server 中对其进行了测试 - 但相同的语法应该适用于两者。
我的大部分查询都依赖 Cross-Apply
和 Outer-Apply
。
Oracle 在 12c 中引入了这些 "Joins":
https://oracle-base.com/articles/12c/lateral-inline-views-cross-apply-and-outer-apply-joins-12cr1
简化且高效:
这使用:
• 无相关子查询。
• 无递归。
• 无 CTE。
• 没有工会。
• 无用户函数。
索引:
我读了你问过的关于索引的评论之一。
我会为您要搜索和分组的每个主要字段添加 1 列索引:
road_id
、event_id
和 year
。
你可以看看这个索引是否对你有帮助(我不知道你打算如何使用这些数据):
关键字段: road_id
、event_id
、year
包括:from_meas
、to_meas
标题:
您可能需要考虑将此问题的标题重命名为更易于搜索的内容,例如:
“聚合重叠段以测量有效长度”。
这将使解决方案更容易找到,以帮助其他有类似问题的人。
其他想法:
像这样的东西在计算 Overall-Time 花费在某事
上会很有用
具有重叠的开始和停止时间戳。