SQL WHERE IN () 性能优化
SQL WHERE IN () Performance Optimization
我检查了几个重复的问题,但找不到一个。我正在处理三个 tables,第一个“articles”,第二个“tags”,第三个“article_tags”,其中包含两个外键“articleid”和“tagid”。 table“article_tags”将共享相同标签的文章关联在一起。
我的SQL查询如下:
'Getting 4 tag ids from table article_tags
MyCommand = New SqlCommand("SELECT TOP (4) at.tagid FROM articles a, article_tags at WHERE at.articleid=a.id AND at.articleid=@id", myconnection)
MyCommand.Parameters.Add("@id", SqlDbType.Int).Value = intID
Dim daQuery = New SqlDataAdapter(MyCommand)
Dim dtArticleTags = New DataTable
daQuery.Fill(dtArticleTags)
Dim strTagIds As String = ""
If dtArticleTags.Rows.Count > 0 Then
'Save all tags id's in a string "strTags"
'---------------------------------------------
Dim i As Integer = 0
For Each myrow As DataRow In dtArticleTags.Rows
'Store the Tag Ids in a string
strTagIds += myrow.Item("tagid").ToString
If i <> dtArticleTags.Rows.Count - 1 Then
strTagIds += ","
End If
i += 1
Next
'---------------------------------------------
'Getting 5 related articles sharing the same tags (Note: I know that strTagIds is not parametrized but this can never be inputted by a user)
MyCommand = New SqlCommand("SELECT TOP(5) at.articleid FROM article_tags at, articles a WHERE a.id=at.articleid AND a.publish_flag=1 AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date) AND at.tagid IN (" & strTagIds & ") AND at.articleid<>@id Group by at.articleid, a.publish_date ORDER BY a.publish_date DESC", myconnection)
MyCommand.Parameters.Add("@id", SqlDbType.Int).Value = intID
daRelated = New SqlDataAdapter(MyCommand)
daRelated.Fill(dsArticle, "related")
End If
我觉得加载上述查询需要时间,尤其是在“article_tags”table 变大的情况下。我正在使用 SQL Express Edition 并且 table 已编入索引,我想知道是否可以通过更好的方式来提高性能。
附件为执行计划:
运行 @SuperPoney 建议的以下查询,返回相同的结果,下面是执行计划:
WITH top_art AS
(
SELECT TOP (4) at.tagid
FROM articles a, article_tags at
WHERE at.articleid=a.id
AND at.articleid=@id
)
SELECT TOP(5) at.articleid
FROM article_tags at, articles a, top_art
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid=top_art.tag_id
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC
我的第一个建议是“让数据库完成工作”。
你能在你的数据库中解析这两个查询的查询计划吗?
也许这会回答您的问题。
查询#1
SELECT TOP(5) at.articleid
FROM article_tags at, articles a
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid IN (
SELECT TOP (4) at.tagid
FROM articles a, article_tags at
WHERE at.articleid=a.id
AND at.articleid=@id
)
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC
查询#2
WITH top_art AS
(
SELECT TOP (4) at.tagid
FROM articles a, article_tags at
WHERE at.articleid=a.id
AND at.articleid=@id
)
SELECT TOP(5) at.articleid
FROM article_tags at, articles a, top_art
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid=top_art.tagid
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC
您可以在一次查询中获得所需的一切:
SELECT TOP (5) a.ID
FROM article AS a
WHERE a.publish_flag = 1
AND a.publish_date < DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
AND a.Id <> @ID
AND EXISTS
( SELECT 1
FROM article_tags AS at
WHERE at.ArticleID = a.ID
AND EXISTS
( SELECT 1
FROM article_tags AS at2
WHERE at2.ArticleID = @ID
AND at2.TagID = at.TagID
)
)
ORDER BY a.publish_date DESC;
我假设您最初使用 TOP 4
标签作为出于性能原因的任意限制,因为没有排序。所以省略了这个。我还更改了您的谓词:
SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
到
a.publish_date < DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
意思是一样的,但是通过在运行时间常数SYSDATETIME()
和UTCDATETIME()
上调用DATEADD
/DATEDIFF
函数就是这个意思计算只进行一次,而不是每个 a.publish_date
一次,这意味着 publish_date
上的任何索引现在都可用。
我所做的另一个更改是使用 EXISTS
而不是 JOIN
到 link 文章标签。这将避免重复,但是使用 GROUP BY
例如
删除重复项同样微不足道
SELECT TOP (5) a.ID
FROM article AS a
INNER JOIN article_tags AS at
ON at.ArticleID = a.ID
WHERE a.publish_flag = 1
AND a.publish_date < DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
AND a.Id <> @ID
AND EXISTS
( SELECT 1
FROM article_tags AS at2
WHERE at2.ArticleID = @ID
AND at2.TagID = at.TagID
)
GROUP BY a.ID, a.publish_date
ORDER BY a.publish_date DESC;
还有一些与上述答案没有直接关系的旁注,但仍然值得一提。
- 您使用的隐式连接语法在 28 年前被 ANSI 92 显式连接语法取代。 are plenty of good reasons 可以切换到“新”语法,所以我建议您这样做。
- 参数化查询不仅仅是 SQL 注入攻击(包括但不限于类型安全和查询计划缓存),所以仅仅因为您的输入不是来自用户并不意味着您不应使用参数化查询。
- 我强烈建议不要 re-using 您的 SqlClient 对象(SqlConnection、SqlCommand),为每次使用创建一个新对象,并在完成后正确处理它。
我检查了几个重复的问题,但找不到一个。我正在处理三个 tables,第一个“articles”,第二个“tags”,第三个“article_tags”,其中包含两个外键“articleid”和“tagid”。 table“article_tags”将共享相同标签的文章关联在一起。
我的SQL查询如下:
'Getting 4 tag ids from table article_tags
MyCommand = New SqlCommand("SELECT TOP (4) at.tagid FROM articles a, article_tags at WHERE at.articleid=a.id AND at.articleid=@id", myconnection)
MyCommand.Parameters.Add("@id", SqlDbType.Int).Value = intID
Dim daQuery = New SqlDataAdapter(MyCommand)
Dim dtArticleTags = New DataTable
daQuery.Fill(dtArticleTags)
Dim strTagIds As String = ""
If dtArticleTags.Rows.Count > 0 Then
'Save all tags id's in a string "strTags"
'---------------------------------------------
Dim i As Integer = 0
For Each myrow As DataRow In dtArticleTags.Rows
'Store the Tag Ids in a string
strTagIds += myrow.Item("tagid").ToString
If i <> dtArticleTags.Rows.Count - 1 Then
strTagIds += ","
End If
i += 1
Next
'---------------------------------------------
'Getting 5 related articles sharing the same tags (Note: I know that strTagIds is not parametrized but this can never be inputted by a user)
MyCommand = New SqlCommand("SELECT TOP(5) at.articleid FROM article_tags at, articles a WHERE a.id=at.articleid AND a.publish_flag=1 AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date) AND at.tagid IN (" & strTagIds & ") AND at.articleid<>@id Group by at.articleid, a.publish_date ORDER BY a.publish_date DESC", myconnection)
MyCommand.Parameters.Add("@id", SqlDbType.Int).Value = intID
daRelated = New SqlDataAdapter(MyCommand)
daRelated.Fill(dsArticle, "related")
End If
我觉得加载上述查询需要时间,尤其是在“article_tags”table 变大的情况下。我正在使用 SQL Express Edition 并且 table 已编入索引,我想知道是否可以通过更好的方式来提高性能。
附件为执行计划:
运行 @SuperPoney 建议的以下查询,返回相同的结果,下面是执行计划:
WITH top_art AS
(
SELECT TOP (4) at.tagid
FROM articles a, article_tags at
WHERE at.articleid=a.id
AND at.articleid=@id
)
SELECT TOP(5) at.articleid
FROM article_tags at, articles a, top_art
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid=top_art.tag_id
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC
我的第一个建议是“让数据库完成工作”。 你能在你的数据库中解析这两个查询的查询计划吗?
也许这会回答您的问题。
查询#1
SELECT TOP(5) at.articleid
FROM article_tags at, articles a
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid IN (
SELECT TOP (4) at.tagid
FROM articles a, article_tags at
WHERE at.articleid=a.id
AND at.articleid=@id
)
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC
查询#2
WITH top_art AS
(
SELECT TOP (4) at.tagid
FROM articles a, article_tags at
WHERE at.articleid=a.id
AND at.articleid=@id
)
SELECT TOP(5) at.articleid
FROM article_tags at, articles a, top_art
WHERE a.id=at.articleid
AND a.publish_flag=1
AND SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
AND at.tagid=top_art.tagid
AND at.articleid<>@id
Group by at.articleid, a.publish_date
ORDER BY a.publish_date DESC
您可以在一次查询中获得所需的一切:
SELECT TOP (5) a.ID
FROM article AS a
WHERE a.publish_flag = 1
AND a.publish_date < DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
AND a.Id <> @ID
AND EXISTS
( SELECT 1
FROM article_tags AS at
WHERE at.ArticleID = a.ID
AND EXISTS
( SELECT 1
FROM article_tags AS at2
WHERE at2.ArticleID = @ID
AND at2.TagID = at.TagID
)
)
ORDER BY a.publish_date DESC;
我假设您最初使用 TOP 4
标签作为出于性能原因的任意限制,因为没有排序。所以省略了这个。我还更改了您的谓词:
SYSDATETIME() > DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), a.publish_date)
到
a.publish_date < DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
意思是一样的,但是通过在运行时间常数SYSDATETIME()
和UTCDATETIME()
上调用DATEADD
/DATEDIFF
函数就是这个意思计算只进行一次,而不是每个 a.publish_date
一次,这意味着 publish_date
上的任何索引现在都可用。
我所做的另一个更改是使用 EXISTS
而不是 JOIN
到 link 文章标签。这将避免重复,但是使用 GROUP BY
例如
SELECT TOP (5) a.ID
FROM article AS a
INNER JOIN article_tags AS at
ON at.ArticleID = a.ID
WHERE a.publish_flag = 1
AND a.publish_date < DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), SYSDATETIME()), SYSDATETIME())
AND a.Id <> @ID
AND EXISTS
( SELECT 1
FROM article_tags AS at2
WHERE at2.ArticleID = @ID
AND at2.TagID = at.TagID
)
GROUP BY a.ID, a.publish_date
ORDER BY a.publish_date DESC;
还有一些与上述答案没有直接关系的旁注,但仍然值得一提。
- 您使用的隐式连接语法在 28 年前被 ANSI 92 显式连接语法取代。 are plenty of good reasons 可以切换到“新”语法,所以我建议您这样做。
- 参数化查询不仅仅是 SQL 注入攻击(包括但不限于类型安全和查询计划缓存),所以仅仅因为您的输入不是来自用户并不意味着您不应使用参数化查询。
- 我强烈建议不要 re-using 您的 SqlClient 对象(SqlConnection、SqlCommand),为每次使用创建一个新对象,并在完成后正确处理它。