使用 SQL 高效查询节点和边
Efficiently query nodes and edges using SQL
我有 2 个 SQL 服务器 table 存储网络信息,EF 模式是:
public partial class edge
{
public long edge_id { get; set; }
public string source { get; set; }
public string target { get; set; }
public Nullable<System.DateTime> edgedate { get; set; }
}
public partial class node
{
public string node_id { get; set; }
public string name { get; set; }
public string address { get; set; }
}
我正在传递来自 UI 的边缘和节点特定过滤器,以构建到 SQL 查询中,如下所示:
select *
from [dbo].[Nodes]
where name = 'John Doe'
or address = '123 Fake Street'
select *
from [dbo].[Edges]
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59
然而,这些查询必须考虑到 整个 网络,即节点过滤器必须应用于边缘,反之亦然 -
-- nodes example with edge filters applied
select *
from [dbo].[Nodes]
where name = 'John Doe'
or address = '123 Fake Street'
and node_id in (select source
from EDGESTEMP
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
union
select target
from EDGESTEMP
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59')
这在小规模网络上运行良好,但是如果我处理的网络有 100 万个边和 500k 个节点,运行 这些查询的性能会受到影响,因为 在每个实例中检查其他 table 时的 语句。
我已经在查询的所有附属列上添加了索引,但是需要知道是否有更有效的方法来执行此操作?
附加信息
查询计划 - here
聚簇索引设置在每个主键上,即 node_id 和 edge_id,非聚簇索引设置在其他主键上,例如-
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20211017-194859] ON [dbo].[NODES]
(
[name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO
您可以使用 exists 代替并集进行更高效的查询。
此外,您不应使用不明确的日期文字。我不确定这些日期是 1 月 12 日还是 12 月 1 日。另外,对于日期时间范围查询,您不应使用 >= 和 <=,而应使用 >= 和 <。您可以在代码中看到这些调整:
select *
from [dbo].[Nodes] n
where (name = 'John Doe'
or address = '123 Fake Street')
and exists (select *
from EDGESTEMP e
where (n.node_id = e.source or n.node_id = e.target)
and e.edgedate >= '20200112'
and e.edgedate < '20210113');
顺便说一句,我假设您已经有了源、目标和边缘日期的索引。如果不创建它们。
首先,您的查询似乎有逻辑错误:or
.
两边应该有括号
其次,UNION ALL
通常优于 UNION
,尽管在 IN
或 EXISTS
[=20 等半连接中通常无关紧要=]
select n.*
from [dbo].[Nodes] n
where (n.name = 'John Doe'
or n.address = '123 Fake Street')
and node_id in (
select source
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
union all
select target
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
);
最后,对于此查询,您可能应该具有以下索引
NODES (name) INCLUDE (node_id)
NODES (address) INCLUDE (node_id)
EDGES (edgedate) INCLUDE (source, target)
or
条件可能仍会导致问题,因为您可能仍会在 NODES
上进行索引扫描。如果是这样,您可能需要重写查询以强制使用索引联合。
select n.*
from (
select *
from [dbo].[Nodes] n
where n.name = 'John Doe'
union
select *
from [dbo].[Nodes] n
where n.address = '123 Fake Street'
) n
where node_id in (
select source
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
union all
select target
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
);
我有 2 个 SQL 服务器 table 存储网络信息,EF 模式是:
public partial class edge
{
public long edge_id { get; set; }
public string source { get; set; }
public string target { get; set; }
public Nullable<System.DateTime> edgedate { get; set; }
}
public partial class node
{
public string node_id { get; set; }
public string name { get; set; }
public string address { get; set; }
}
我正在传递来自 UI 的边缘和节点特定过滤器,以构建到 SQL 查询中,如下所示:
select *
from [dbo].[Nodes]
where name = 'John Doe'
or address = '123 Fake Street'
select *
from [dbo].[Edges]
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59
然而,这些查询必须考虑到 整个 网络,即节点过滤器必须应用于边缘,反之亦然 -
-- nodes example with edge filters applied
select *
from [dbo].[Nodes]
where name = 'John Doe'
or address = '123 Fake Street'
and node_id in (select source
from EDGESTEMP
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
union
select target
from EDGESTEMP
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59')
这在小规模网络上运行良好,但是如果我处理的网络有 100 万个边和 500k 个节点,运行 这些查询的性能会受到影响,因为 在每个实例中检查其他 table 时的 语句。
我已经在查询的所有附属列上添加了索引,但是需要知道是否有更有效的方法来执行此操作?
附加信息
查询计划 - here
聚簇索引设置在每个主键上,即 node_id 和 edge_id,非聚簇索引设置在其他主键上,例如-
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20211017-194859] ON [dbo].[NODES]
(
[name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO
您可以使用 exists 代替并集进行更高效的查询。
此外,您不应使用不明确的日期文字。我不确定这些日期是 1 月 12 日还是 12 月 1 日。另外,对于日期时间范围查询,您不应使用 >= 和 <=,而应使用 >= 和 <。您可以在代码中看到这些调整:
select *
from [dbo].[Nodes] n
where (name = 'John Doe'
or address = '123 Fake Street')
and exists (select *
from EDGESTEMP e
where (n.node_id = e.source or n.node_id = e.target)
and e.edgedate >= '20200112'
and e.edgedate < '20210113');
顺便说一句,我假设您已经有了源、目标和边缘日期的索引。如果不创建它们。
首先,您的查询似乎有逻辑错误:or
.
其次,UNION ALL
通常优于 UNION
,尽管在 IN
或 EXISTS
[=20 等半连接中通常无关紧要=]
select n.*
from [dbo].[Nodes] n
where (n.name = 'John Doe'
or n.address = '123 Fake Street')
and node_id in (
select source
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
union all
select target
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
);
最后,对于此查询,您可能应该具有以下索引
NODES (name) INCLUDE (node_id)
NODES (address) INCLUDE (node_id)
EDGES (edgedate) INCLUDE (source, target)
or
条件可能仍会导致问题,因为您可能仍会在 NODES
上进行索引扫描。如果是这样,您可能需要重写查询以强制使用索引联合。
select n.*
from (
select *
from [dbo].[Nodes] n
where n.name = 'John Doe'
union
select *
from [dbo].[Nodes] n
where n.address = '123 Fake Street'
) n
where node_id in (
select source
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
union all
select target
from EDGES
where edgedate >= '01/12/2020 00:00:00'
and edgedate <= '01/12/2021 23:59:59'
);