Django ORM:如何按日期排序,然后 select 按外键分组的 objects 中最好的?
Django ORM: How can I sort by date and then select the best of the objects grouped by a foreign key?
我知道我的标题有点复杂,但请允许我演示一下。我在 Django 2.2.5 上使用 Python 3。以下是我目前正在使用的模型:
from django.db import models
from django.db.models import F
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField, SearchVector, SearchQuery, SearchRank
class Thread(models.Model):
title = models.CharField(max_length=100)
last_update = models.DateTimeField(auto_now=True)
class PostQuerySet(models.QuerySet):
_search_vector = SearchVector('thread__type') + \
SearchVector('thread__title') + \
SearchVector('from_name') + \
SearchVector('from_email') + \
SearchVector('message')
###
# There's code here that updates the `Post.search_vector` field for each `Post` object
# using `PostQuerySet._search_vector`.
###
def search(self, text):
"""
Search posts using the indexed `search_vector` field. I can, for example, call
`Post.objects.search('influenza h1n1')`.
"""
search_query = SearchQuery(text)
search_rank = SearchRank(F('search_vector'), search_query)
return self.annotate(rank=search_rank).filter(search_vector=search_query).order_by('-rank')
class Post(models.Model):
thread = models.ForeignKey(Thread, on_delete=models.CASCADE)
timestamp = models.DateTimeField()
from_name = models.CharField(max_length=100)
from_email = models.EmailField()
message = models.TextField()
in_response_to = models.ManyToManyField('self', symmetrical=False, blank=True)
search_vector = SearchVectorField(null=True)
objects = PostQuerySet().as_manager()
class Meta:
ordering = ['timestamp']
indexes = [
GinIndex(fields=['search_vector'])
]
(为了简洁起见,我删除了这些模型中的一些内容,我认为这些内容无关紧要,但如果以后变得重要,我会添加它。)
在英语中,我正在使用一个代表电子邮件列表服务中的数据的应用程序。基本上,有一个 Thread
包含多个 Post
objects;人们 reply-all 到最初的 post 并发起讨论。我刚刚使用 built-in 支持 Django 在 Django 中为 full-text 搜索实现了搜索功能。它超级快,我喜欢它。这是我在 views.py
:
中搜索的示例
###
# Pull `query` from a form defined in `forms.py`.
###
search_results = Post.objects.search(query).order_by('-timestamp')
一切都很好,而且 returns 搜索结果绝对有意义。但是我刚刚遇到了一个我不太确定如何处理的情况。显示的结果并不像我想要的那么干净。此查询得到的是与 user-provided query
匹配的所有 Post
objects。这很好,但是在同一个 Thread
中可能有很多 Post
objects 会阻塞结果。可能是这样的:
post5 from thread2 - timestamp 2018-04-01, rank 0.5
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post3 from thread1 - timestamp 2018-01-01, rank 0.6
post2 from thread1 - timestamp 2017-12-01, rank 0.7
post2 from thread2 - timestamp 2017-11-01, rank 0.7
(这里,rank
是Django的SearchRank
方法返回的相关度。)
我真正想要的是:我想为每个 Thread
显示最具代表性的匹配 Post
,按时间戳降序排列。换句话说,对于搜索结果中每个包含Post
的Thread
,只显示最高的rank
Post
,最高的rank
Post
objects 应该按时间戳降序排列。所以在上面的例子中,这些是我想看到的结果:
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post2 from thread1 - timestamp 2017-12-01, rank 0.7
用几个 for
循环来做我想做的事情会相当简单,但我真的希望有一种方法可以纯粹在 ORM 中完成这个以提高效率。你们有什么建议吗?如果您需要我澄清有关问题设置或我想要的任何内容,请告诉我。
我认为我们可以使用 distinct
到 select 结果组的第一行。
你能试试这样吗:
results = Thread.objects.filter(post__search_vector=search_query) \
.annotate(rank=search_rank) \
.order_by('id', '-rank') \
.distinct('id')
# Then sort these **limited results** by rank manually in python instead of by thread id
# The performance of this should be much better than looping over all results in Python
我无法测试它,因为我没有适当的 Django 模型设置。请分享以上 print(results.query)
的输出。
我想你必须查询 Post 模型按 thread, rank 和 timestamp 然后在 thread.
上使用 distinct
搜索
这是按时间戳排序的搜索:
Post.objects.search("text").order_by("-timestamp")
这是在我本地 PostgreSQL:
上执行的 SQL
SELECT
"post"."from_name",
"thread"."title",
"post"."timestamp",
ts_rank("post"."search_vector", plainto_tsquery('text')) AS "rank"
FROM
"post"
INNER JOIN "thread" ON ("post"."thread_id" = "thread"."id")
WHERE
"post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
"post"."timestamp" DESC
这些是我本地数据的搜索结果:
post1 from thread1 - timestamp 2019-07-01, rank 0.0607927
post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread2 - timestamp 2019-05-01, rank 0.0607927
post3 from thread1 - timestamp 2019-05-01, rank 0.0607927
post1 from thread3 - timestamp 2019-05-01, rank 0.0607927
post3 from thread2 - timestamp 2019-04-01, rank 0.0759909
post4 from thread1 - timestamp 2019-04-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
post5 from thread1 - timestamp 2019-03-01, rank 0.0607927
post3 from thread3 - timestamp 2019-03-01, rank 0.0607927
post4 from thread2 - timestamp 2019-03-01, rank 0.0607927
post5 from thread2 - timestamp 2019-02-01, rank 0.0759909
post4 from thread3 - timestamp 2019-02-01, rank 0.0759909
post5 from thread3 - timestamp 2019-01-01, rank 0.0759909
解决方案
这是正确的查询,只显示每个线程的最具代表性的匹配 Post(基于搜索排名),按时间戳降序排列
Post.objects.search("text").order_by(
"thread", "-rank", "-timestamp"
).distinct("thread")
这是在我本地 PostgreSQL:
上执行的 SQL
SELECT DISTINCT ON ("forum_post"."thread_id")
"forum_post"."from_name",
"forum_thread"."title",
"forum_post"."timestamp",
ts_rank("forum_post"."search_vector", plainto_tsquery('dolor')) AS "rank"
FROM
"forum_post"
INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id")
WHERE
"forum_post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
"forum_post"."thread_id" ASC,
"rank" DESC,
"forum_post"."timestamp" DESC
这些是我本地数据的搜索结果:
post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
备注
您可以在官方 Django 文档中阅读有关 distinct
的更多信息。
更新
如果您需要绝对按时间戳倒序排序并且不需要显示排名,则可以在上一个查询之后使用子查询对您的帖子进行排序:
Post.objects.filter(
pk__in=Subquery(
Post.objects.search("dolor")
.order_by("-thread", "-rank", "-timestamp")
.distinct("thread")
.values("id")
)
).order_by("-timestamp")
这是在我本地 PostgreSQL:
上执行的 SQL
SELECT
"forum_post"."from_name",
"forum_thread"."title",
"forum_post"."timestamp"
FROM
"forum_post"
INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id")
WHERE
"forum_post"."id" IN ( SELECT DISTINCT ON (U0. "thread_id")
U0. "id"
FROM
"forum_post" U0
WHERE
U0. "search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
U0. "thread_id" DESC,
ts_rank(U0. "search_vector", plainto_tsquery('dolor'))
DESC,
U0. "timestamp" DESC)
ORDER BY
"forum_post"."timestamp" DESC
这些是我本地数据的搜索结果:
post2 from thread1 - timestamp 2019-06-01
post1 from thread2 - timestamp 2019-06-01
post2 from thread3 - timestamp 2019-04-01
我知道我的标题有点复杂,但请允许我演示一下。我在 Django 2.2.5 上使用 Python 3。以下是我目前正在使用的模型:
from django.db import models
from django.db.models import F
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField, SearchVector, SearchQuery, SearchRank
class Thread(models.Model):
title = models.CharField(max_length=100)
last_update = models.DateTimeField(auto_now=True)
class PostQuerySet(models.QuerySet):
_search_vector = SearchVector('thread__type') + \
SearchVector('thread__title') + \
SearchVector('from_name') + \
SearchVector('from_email') + \
SearchVector('message')
###
# There's code here that updates the `Post.search_vector` field for each `Post` object
# using `PostQuerySet._search_vector`.
###
def search(self, text):
"""
Search posts using the indexed `search_vector` field. I can, for example, call
`Post.objects.search('influenza h1n1')`.
"""
search_query = SearchQuery(text)
search_rank = SearchRank(F('search_vector'), search_query)
return self.annotate(rank=search_rank).filter(search_vector=search_query).order_by('-rank')
class Post(models.Model):
thread = models.ForeignKey(Thread, on_delete=models.CASCADE)
timestamp = models.DateTimeField()
from_name = models.CharField(max_length=100)
from_email = models.EmailField()
message = models.TextField()
in_response_to = models.ManyToManyField('self', symmetrical=False, blank=True)
search_vector = SearchVectorField(null=True)
objects = PostQuerySet().as_manager()
class Meta:
ordering = ['timestamp']
indexes = [
GinIndex(fields=['search_vector'])
]
(为了简洁起见,我删除了这些模型中的一些内容,我认为这些内容无关紧要,但如果以后变得重要,我会添加它。)
在英语中,我正在使用一个代表电子邮件列表服务中的数据的应用程序。基本上,有一个 Thread
包含多个 Post
objects;人们 reply-all 到最初的 post 并发起讨论。我刚刚使用 built-in 支持 Django 在 Django 中为 full-text 搜索实现了搜索功能。它超级快,我喜欢它。这是我在 views.py
:
###
# Pull `query` from a form defined in `forms.py`.
###
search_results = Post.objects.search(query).order_by('-timestamp')
一切都很好,而且 returns 搜索结果绝对有意义。但是我刚刚遇到了一个我不太确定如何处理的情况。显示的结果并不像我想要的那么干净。此查询得到的是与 user-provided query
匹配的所有 Post
objects。这很好,但是在同一个 Thread
中可能有很多 Post
objects 会阻塞结果。可能是这样的:
post5 from thread2 - timestamp 2018-04-01, rank 0.5
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post3 from thread1 - timestamp 2018-01-01, rank 0.6
post2 from thread1 - timestamp 2017-12-01, rank 0.7
post2 from thread2 - timestamp 2017-11-01, rank 0.7
(这里,rank
是Django的SearchRank
方法返回的相关度。)
我真正想要的是:我想为每个 Thread
显示最具代表性的匹配 Post
,按时间戳降序排列。换句话说,对于搜索结果中每个包含Post
的Thread
,只显示最高的rank
Post
,最高的rank
Post
objects 应该按时间戳降序排列。所以在上面的例子中,这些是我想看到的结果:
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post2 from thread1 - timestamp 2017-12-01, rank 0.7
用几个 for
循环来做我想做的事情会相当简单,但我真的希望有一种方法可以纯粹在 ORM 中完成这个以提高效率。你们有什么建议吗?如果您需要我澄清有关问题设置或我想要的任何内容,请告诉我。
我认为我们可以使用 distinct
到 select 结果组的第一行。
你能试试这样吗:
results = Thread.objects.filter(post__search_vector=search_query) \
.annotate(rank=search_rank) \
.order_by('id', '-rank') \
.distinct('id')
# Then sort these **limited results** by rank manually in python instead of by thread id
# The performance of this should be much better than looping over all results in Python
我无法测试它,因为我没有适当的 Django 模型设置。请分享以上 print(results.query)
的输出。
我想你必须查询 Post 模型按 thread, rank 和 timestamp 然后在 thread.
上使用distinct
搜索
这是按时间戳排序的搜索:
Post.objects.search("text").order_by("-timestamp")
这是在我本地 PostgreSQL:
上执行的 SQLSELECT
"post"."from_name",
"thread"."title",
"post"."timestamp",
ts_rank("post"."search_vector", plainto_tsquery('text')) AS "rank"
FROM
"post"
INNER JOIN "thread" ON ("post"."thread_id" = "thread"."id")
WHERE
"post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
"post"."timestamp" DESC
这些是我本地数据的搜索结果:
post1 from thread1 - timestamp 2019-07-01, rank 0.0607927
post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread2 - timestamp 2019-05-01, rank 0.0607927
post3 from thread1 - timestamp 2019-05-01, rank 0.0607927
post1 from thread3 - timestamp 2019-05-01, rank 0.0607927
post3 from thread2 - timestamp 2019-04-01, rank 0.0759909
post4 from thread1 - timestamp 2019-04-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
post5 from thread1 - timestamp 2019-03-01, rank 0.0607927
post3 from thread3 - timestamp 2019-03-01, rank 0.0607927
post4 from thread2 - timestamp 2019-03-01, rank 0.0607927
post5 from thread2 - timestamp 2019-02-01, rank 0.0759909
post4 from thread3 - timestamp 2019-02-01, rank 0.0759909
post5 from thread3 - timestamp 2019-01-01, rank 0.0759909
解决方案
这是正确的查询,只显示每个线程的最具代表性的匹配 Post(基于搜索排名),按时间戳降序排列
Post.objects.search("text").order_by( "thread", "-rank", "-timestamp" ).distinct("thread")
这是在我本地 PostgreSQL:
上执行的 SQLSELECT DISTINCT ON ("forum_post"."thread_id") "forum_post"."from_name", "forum_thread"."title", "forum_post"."timestamp", ts_rank("forum_post"."search_vector", plainto_tsquery('dolor')) AS "rank" FROM "forum_post" INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id") WHERE "forum_post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE ORDER BY "forum_post"."thread_id" ASC, "rank" DESC, "forum_post"."timestamp" DESC
这些是我本地数据的搜索结果:
post2 from thread1 - timestamp 2019-06-01, rank 0.0759909 post1 from thread2 - timestamp 2019-06-01, rank 0.0759909 post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
备注
您可以在官方 Django 文档中阅读有关 distinct
的更多信息。
更新
如果您需要绝对按时间戳倒序排序并且不需要显示排名,则可以在上一个查询之后使用子查询对您的帖子进行排序:
Post.objects.filter( pk__in=Subquery( Post.objects.search("dolor") .order_by("-thread", "-rank", "-timestamp") .distinct("thread") .values("id") ) ).order_by("-timestamp")
这是在我本地 PostgreSQL:
上执行的 SQLSELECT "forum_post"."from_name", "forum_thread"."title", "forum_post"."timestamp" FROM "forum_post" INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id") WHERE "forum_post"."id" IN ( SELECT DISTINCT ON (U0. "thread_id") U0. "id" FROM "forum_post" U0 WHERE U0. "search_vector" @@ (plainto_tsquery('dolor')) = TRUE ORDER BY U0. "thread_id" DESC, ts_rank(U0. "search_vector", plainto_tsquery('dolor')) DESC, U0. "timestamp" DESC) ORDER BY "forum_post"."timestamp" DESC
这些是我本地数据的搜索结果:
post2 from thread1 - timestamp 2019-06-01 post1 from thread2 - timestamp 2019-06-01 post2 from thread3 - timestamp 2019-04-01