SQL WHERE IN () 性能优化

SQL WHERE IN () Performance Optimization

我检查了几个重复的问题,但找不到一个。我正在处理三个 tables,第一个“articles”,第二个“tags”,第三个“article_tags”,其中包含两个外键“articleid”和“tagid”。 table“article_tags”将共享相同标签的文章关联在一起。

我的SQL查询如下:

'Getting 4 tag ids from table article_tags
MyCommand = New SqlCommand("SELECT TOP (4) at.tagid FROM articles a, article_tags at WHERE at.articleid=a.id AND at.articleid=@id", myconnection)
MyCommand.Parameters.Add("@id", SqlDbType.Int).Value = intID
Dim daQuery = New SqlDataAdapter(MyCommand)
Dim dtArticleTags = New DataTable
daQuery.Fill(dtArticleTags)
Dim strTagIds As String = ""
If dtArticleTags.Rows.Count > 0 Then
    'Save all tags id's in a string "strTags"
    '--------------------------------------------- 
    Dim i As Integer = 0
    For Each myrow As DataRow In dtArticleTags.Rows
        'Store the Tag Ids in a string
        strTagIds += myrow.Item("tagid").ToString
        If i <> dtArticleTags.Rows.Count - 1 Then
            strTagIds += ","
        End If
        i += 1
    Next
    '--------------------------------------------- 
    'Getting 5 related articles sharing the same tags (Note: I know that strTagIds is not parametrized but this can never be inputted by a user)
    MyCommand = New SqlCommand("SELECT TOP(5) at.articleid FROM article_tags at, articles a WHERE a.id=at.articleid AND a.publish_flag=1 AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date) AND at.tagid IN (" & strTagIds & ") AND at.articleid<>@id Group by at.articleid, a.publish_date ORDER BY a.publish_date DESC", myconnection)
    MyCommand.Parameters.Add("@id", SqlDbType.Int).Value = intID
    daRelated = New SqlDataAdapter(MyCommand)
    daRelated.Fill(dsArticle, "related")
End If

我觉得加载上述查询需要时间,尤其是在“article_tags”table 变大的情况下。我正在使用 SQL Express Edition 并且 table 已编入索引,我想知道是否可以通过更好的方式来提高性能。

附件为执行计划:

运行 @SuperPoney 建议的以下查询,返回相同的结果,下面是执行计划:

WITH top_art AS
(
    SELECT TOP (4) at.tagid
    FROM articles a, article_tags at
    WHERE at.articleid=a.id
    AND at.articleid=@id
)
SELECT TOP(5) at.articleid
FROM article_tags at, articles a, top_art
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid=top_art.tag_id
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC

我的第一个建议是“让数据库完成工作”。 你能在你的数据库中解析这两个查询的查询计划吗?

也许这会回答您的问题。

查询#1

SELECT TOP(5) at.articleid
FROM article_tags at, articles a
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid IN (
    SELECT TOP (4) at.tagid
    FROM articles a, article_tags at
    WHERE at.articleid=a.id
    AND at.articleid=@id
)
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC

查询#2

WITH top_art AS
(
    SELECT TOP (4) at.tagid
    FROM articles a, article_tags at
    WHERE at.articleid=a.id
    AND at.articleid=@id
)
SELECT TOP(5) at.articleid
FROM article_tags at, articles a, top_art
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid=top_art.tagid
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC

您可以在一次查询中获得所需的一切:

SELECT  TOP (5) a.ID
FROM    article AS a
WHERE   a.publish_flag = 1 
AND     a.publish_date <  DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
AND     a.Id <> @ID 
AND     EXISTS 
        (   SELECT  1
            FROM    article_tags AS at
            WHERE   at.ArticleID = a.ID
            AND     EXISTS  
                    (   SELECT  1 
                        FROM    article_tags AS at2
                        WHERE   at2.ArticleID = @ID
                        AND     at2.TagID = at.TagID
                    )
        )
ORDER BY a.publish_date DESC;
                    

我假设您最初使用 TOP 4 标签作为出于性能原因的任意限制,因为没有排序。所以省略了这个。我还更改了您的谓词:

SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)

a.publish_date <  DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())

意思是一样的,但是通过在运行时间常数SYSDATETIME()UTCDATETIME()上调用DATEADD/DATEDIFF函数就是这个意思计算只进行一次,而不是每个 a.publish_date 一次,这意味着 publish_date 上的任何索引现在都可用。

我所做的另一个更改是使用 EXISTS 而不是 JOIN 到 link 文章标签。这将避免重复,但是使用 GROUP BY 例如

删除重复项同样微不足道
SELECT  TOP (5) a.ID
FROM    article AS a
        INNER JOIN article_tags AS at
            ON at.ArticleID = a.ID
WHERE   a.publish_flag = 1 
AND     a.publish_date <  DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
AND     a.Id <> @ID 
AND     EXISTS  
        (   SELECT  1 
            FROM    article_tags AS at2
            WHERE   at2.ArticleID = @ID
            AND     at2.TagID = at.TagID
        )
GROUP BY a.ID, a.publish_date
ORDER BY a.publish_date DESC;

还有一些与上述答案没有直接关系的旁注,但仍然值得一提。

  1. 您使用的隐式连接语法在 28 年前被 ANSI 92 显式连接语法取代。 are plenty of good reasons 可以切换到“新”语法,所以我建议您这样做。
  2. 参数化查询不仅仅是 SQL 注入攻击(包括但不限于类型安全和查询计划缓存),所以仅仅因为您的输入不是来自用户并不意味着您不应使用参数化查询。
  3. 我强烈建议不要 re-using 您的 SqlClient 对象(SqlConnection、SqlCommand),为每次使用创建一个新对象,并在完成后正确处理它。