使用 SQL 高效查询节点和边

Question

我有 2 个 SQL 服务器 table 存储网络信息，EF 模式是：

public partial class edge
{
    public long edge_id { get; set; }
    public string source { get; set; }
    public string target { get; set; }
    public Nullable<System.DateTime> edgedate { get; set; }
}

public partial class node
{
    public string node_id { get; set; }
    public string name { get; set; }
    public string address { get; set; }
}

我正在传递来自 UI 的边缘和节点特定过滤器，以构建到 SQL 查询中，如下所示：

select * 
from [dbo].[Nodes] 
where name = 'John Doe' 
   or address = '123 Fake Street'  

select * 
from [dbo].[Edges] 
where edgedate >= '01/12/2020 00:00:00' 
  and edgedate <= '01/12/2021 23:59:59

然而，这些查询必须考虑到整个网络，即节点过滤器必须应用于边缘，反之亦然 -

-- nodes example with edge filters applied
select * 
from [dbo].[Nodes] 
where name = 'John Doe' 
   or address = '123 Fake Street'  
   and node_id in (select source 
                   from EDGESTEMP 
                   where edgedate >= '01/12/2020 00:00:00' 
                     and edgedate <= '01/12/2021 23:59:59'
                   union 
                   select target 
                   from EDGESTEMP 
                   where edgedate >= '01/12/2020 00:00:00' 
                     and edgedate <= '01/12/2021 23:59:59')

这在小规模网络上运行良好，但是如果我处理的网络有 100 万个边和 500k 个节点，运行这些查询的性能会受到影响，因为 在每个实例中检查其他 table 时的 语句。

我已经在查询的所有附属列上添加了索引，但是需要知道是否有更有效的方法来执行此操作？

附加信息

查询计划 - here

聚簇索引设置在每个主键上，即 node_id 和 edge_id，非聚簇索引设置在其他主键上，例如-

CREATE NONCLUSTERED INDEX [NonClusteredIndex-20211017-194859] ON [dbo].[NODES]
(
    [name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

Answer 1

您可以使用 exists 代替并集进行更高效的查询。

此外，您不应使用不明确的日期文字。我不确定这些日期是 1 月 12 日还是 12 月 1 日。另外，对于日期时间范围查询，您不应使用 >= 和 <=，而应使用 >= 和 <。您可以在代码中看到这些调整：

select * 
from [dbo].[Nodes] n
where (name = 'John Doe' 
   or address = '123 Fake Street')  
   and exists (select * 
               from EDGESTEMP e
                   where (n.node_id = e.source or n.node_id = e.target)
                         and e.edgedate >= '20200112' 
                         and e.edgedate < '20210113');

顺便说一句，我假设您已经有了源、目标和边缘日期的索引。如果不创建它们。

Answer 2

首先，您的查询似乎有逻辑错误：or.

两边应该有括号

其次，UNION ALL 通常优于 UNION，尽管在 IN 或 EXISTS[=20 等半连接中通常无关紧要=]

select n.* 
from [dbo].[Nodes] n
where (n.name = 'John Doe' 
       or n.address = '123 Fake Street')
 and node_id in (
    select source
    from EDGES 
    where edgedate >= '01/12/2020 00:00:00' 
      and edgedate <= '01/12/2021 23:59:59'
    union all
    select target 
    from EDGES
    where edgedate >= '01/12/2020 00:00:00' 
      and edgedate <= '01/12/2021 23:59:59'
);

最后，对于此查询，您可能应该具有以下索引

NODES (name) INCLUDE (node_id)

NODES (address) INCLUDE (node_id)

EDGES (edgedate) INCLUDE (source, target)

or 条件可能仍会导致问题，因为您可能仍会在 NODES 上进行索引扫描。如果是这样，您可能需要重写查询以强制使用索引联合。

select n.* 
from (
    select *
      from [dbo].[Nodes] n
      where n.name = 'John Doe'
    union
    select *
      from [dbo].[Nodes] n
      where n.address = '123 Fake Street'
) n
where node_id in (
    select source
    from EDGES 
    where edgedate >= '01/12/2020 00:00:00' 
      and edgedate <= '01/12/2021 23:59:59'
    union all
    select target 
    from EDGES
    where edgedate >= '01/12/2020 00:00:00' 
      and edgedate <= '01/12/2021 23:59:59'
);

使用 SQL 高效查询节点和边

Efficiently query nodes and edges using SQL

sql-server

edges

nodes