是否有差距和孤岛问题的正式定义?如果是这样,这个问题是否满足它?

Is there a formal definition of Gaps and Islands problems? If so, does this problem satisfy it?

差距和孤岛”这个词似乎在我的工作场所被过度使用了。我最近在那个旗帜下​​基本上遇到了以下问题。

Take a set of data with many rows, each containing lots of data, but in particular, always including a start and stop time column and including many other columns where if one is not NULL then the others are. For example:

Start Time Stop Time Drunkenness Programming Ability
01 60 0100 NULL
10 20 NULL 0450
40 50 NULL 0250

(you may also use the obvious unpivoted equivalent, but don't worry about that)

and convert that data in to a form where all of the data is collapsed in such a way that you can find out what's true at any given time by only needing to look at the single row that corresponds to that time period. So, for the previous example, you want this:

Start Time Stop Time Drunkenness Programming Ability
01 09 0100 NULL
10 20 0100 0450
21 39 0100 NULL
40 50 0100 0250
51 60 0100 NULL

To see that this is what you really want, look at the times in the original rows. Until time 10, only "Dunkenness=0100" is given, so our first row in the result must span from 01 to 09 and contain only Drunkenness info. The next row in the original table spans from 10 to 20, so we must have a row for that time period in the result and it must contain any information that is true at that time (i.e. the "Drunkenness=0100" that is always true and the "Programming Ability = 0450" that is true only between times 10 and 20). As "Programming Ability" is left undefined from time 21 to 39, we must have yet another row where that is NULL. The other two rows are then generated by the same process as the previous rows, so we get the table above.

这真的是一个“差距和孤岛”的问题吗?还是文献给了它一个不同的名字? 我同意第一个数据集中存在差距,最终数据集中的结果被分成岛屿,但这似乎不是文献中的内容指的是当它谈到“差距和孤岛”问题时。文献似乎更关心寻找差距或寻找岛屿,而不是像这样将差距变成岛屿并合并数据。

使用SQL标签是因为这是一个关系数据库。我不是在寻求解决方案,我怀疑在您的回答中包含 SQL 解决方案是否具有启发性,尽管他们会受到欢迎。因此,我没有在这个问题中包含任何 SQL 代码。

我不认为这个问题是基于意见的。我已经看到足够多的关于差距和孤岛问题的报道,相信在某处必须对它们进行正式定义。 强烈建议为这些问题提供正式定义和来源。如果这不是间隙和孤岛问题,而是其他问题,那么请为其提供名称和来源定义。

条件 if one is not NULL the others are 意味着您的行只是键值对的不同表示。换句话说,它的非透视变体如下所示

Key Value Start End
Drunkenness 100 01 60
Programming Ability 450 10 20
Programming Ability 250 40 50

假设它通过了数据完整性检查,即不存在相同键的不同值的重叠区间。然后它看起来像一个 type-2 slowly changing dimension 实际上我们可以将 编程能力 在 20 和 40(不含)之间的值缺失解释为 NULL。

但是,也可以将该数据解释为 两个 分开的 tables,醉酒和编程能力合并(通过完全连接)的开始和结束日期间隔。

SELECT coalesce(a.start,b.start) as start, coalesce(a.end,b.end) as end,
a.Value, b.Value 
from a full join b on a.start=b.start and a.end = b.end

因此,例如,b 缺少 (10,60) 的数据,并且您在那里的第一行中的编程能力为 NULL。如果你正确加入这两个 table 考虑到时间间隔重叠,你可以获得第二个 table。

SELECT greatest(a.start,b.start) as start, least(a.end,b.end) as end,
a.Value, b.Value 
from a full join b on a.start <= b.end and b.start <= a.end

不管怎样,这都不是缺口和孤岛问题。在那个问题中,数据有一些可能有间隙的重叠区间,并且必须确定由不连续间隙分隔的非重叠连续区间。