Django ORM:如何按日期排序,然后 select 按外键分组的 objects 中最好的?

Django ORM: How can I sort by date and then select the best of the objects grouped by a foreign key?

我知道我的标题有点复杂,但请允许我演示一下。我在 Django 2.2.5 上使用 Python 3。以下是我目前正在使用的模型:

from django.db import models
from django.db.models import F
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField, SearchVector, SearchQuery, SearchRank

class Thread(models.Model):
    title = models.CharField(max_length=100)
    last_update = models.DateTimeField(auto_now=True)

class PostQuerySet(models.QuerySet):
    _search_vector = SearchVector('thread__type') + \
                     SearchVector('thread__title') + \
                     SearchVector('from_name') + \
                     SearchVector('from_email') + \
                     SearchVector('message')

    ###
    # There's code here that updates the `Post.search_vector` field for each `Post` object
    # using `PostQuerySet._search_vector`.
    ###

    def search(self, text):
        """
            Search posts using the indexed `search_vector` field. I can, for example, call
            `Post.objects.search('influenza h1n1')`.
        """
        search_query = SearchQuery(text)
        search_rank = SearchRank(F('search_vector'), search_query)
        return self.annotate(rank=search_rank).filter(search_vector=search_query).order_by('-rank')

class Post(models.Model):
    thread = models.ForeignKey(Thread, on_delete=models.CASCADE)
    timestamp = models.DateTimeField()
    from_name = models.CharField(max_length=100)
    from_email = models.EmailField()
    message = models.TextField()
    in_response_to = models.ManyToManyField('self', symmetrical=False, blank=True)
    search_vector = SearchVectorField(null=True)

    objects = PostQuerySet().as_manager()

    class Meta:
        ordering = ['timestamp']
        indexes = [
            GinIndex(fields=['search_vector'])
        ]

(为了简洁起见,我删除了这些模型中的一些内容,我认为这些内容无关紧要,但如果以后变得重要,我会添加它。)

在英语中,我正在使用一个代表电子邮件列表服务中的数据的应用程序。基本上,有一个 Thread 包含多个 Post objects;人们 reply-all 到最初的 post 并发起讨论。我刚刚使用 built-in 支持 Django 在 Django 中为 full-text 搜索实现了搜索功能。它超级快,我喜欢它。这是我在 views.py:

中搜索的示例
###
# Pull `query` from a form defined in `forms.py`.
###

search_results = Post.objects.search(query).order_by('-timestamp')

一切都很好,而且 returns 搜索结果绝对有意义。但是我刚刚遇到了一个我不太确定如何处理的情况。显示的结果并不像我想要的那么干净。此查询得到的是与 user-provided query 匹配的所有 Post objects。这很好,但是在同一个 Thread 中可能有很多 Post objects 会阻塞结果。可能是这样的:

post5 from thread2 - timestamp 2018-04-01, rank 0.5
post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post3 from thread1 - timestamp 2018-01-01, rank 0.6
post2 from thread1 - timestamp 2017-12-01, rank 0.7
post2 from thread2 - timestamp 2017-11-01, rank 0.7

(这里,rank是Django的SearchRank方法返回的相关度。)

我真正想要的是:我想为每个 Thread 显示最具代表性的匹配 Post,按时间戳降序排列。换句话说,对于搜索结果中每个包含PostThread,只显示最高的rank Post,最高的rank Post objects 应该按时间戳降序排列。所以在上面的例子中,这些是我想看到的结果:

post1 from thread3 - timestamp 2018-03-01, rank 0.25
post3 from thread2 - timestamp 2018-02-01, rank 0.75
post2 from thread1 - timestamp 2017-12-01, rank 0.7

用几个 for 循环来做我想做的事情会相当简单,但我真的希望有一种方法可以纯粹在 ORM 中完成这个以提高效率。你们有什么建议吗?如果您需要我澄清有关问题设置或我想要的任何内容,请告诉我。

我认为我们可以使用 distinct 到 select 结果组的第一行。

你能试试这样吗:

results = Thread.objects.filter(post__search_vector=search_query) \
    .annotate(rank=search_rank) \
    .order_by('id', '-rank') \
    .distinct('id')

# Then sort these **limited results** by rank manually in python instead of by thread id
# The performance of this should be much better than looping over all results in Python

我无法测试它,因为我没有适当的 Django 模型设置。请分享以上 print(results.query) 的输出。

我想你必须查询 Post 模型按 thread, ranktimestamp 然后在 thread.

上使用 distinct

搜索

这是按时间戳排序的搜索:

Post.objects.search("text").order_by("-timestamp")

这是在我本地 PostgreSQL:

上执行的 SQL
SELECT
    "post"."from_name",
    "thread"."title",
    "post"."timestamp",
    ts_rank("post"."search_vector", plainto_tsquery('text')) AS "rank"
FROM
    "post"
    INNER JOIN "thread" ON ("post"."thread_id" = "thread"."id")
WHERE
    "post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
    "post"."timestamp" DESC

这些是我本地数据的搜索结果:

post1 from thread1 - timestamp 2019-07-01, rank 0.0607927
post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread2 - timestamp 2019-05-01, rank 0.0607927
post3 from thread1 - timestamp 2019-05-01, rank 0.0607927
post1 from thread3 - timestamp 2019-05-01, rank 0.0607927
post3 from thread2 - timestamp 2019-04-01, rank 0.0759909
post4 from thread1 - timestamp 2019-04-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909
post5 from thread1 - timestamp 2019-03-01, rank 0.0607927
post3 from thread3 - timestamp 2019-03-01, rank 0.0607927
post4 from thread2 - timestamp 2019-03-01, rank 0.0607927
post5 from thread2 - timestamp 2019-02-01, rank 0.0759909
post4 from thread3 - timestamp 2019-02-01, rank 0.0759909
post5 from thread3 - timestamp 2019-01-01, rank 0.0759909

解决方案

这是正确的查询,只显示每个线程的最具代表性的匹配 Post(基于搜索排名),按时间戳降序排列

Post.objects.search("text").order_by(
   "thread", "-rank", "-timestamp"
).distinct("thread")

这是在我本地 PostgreSQL:

上执行的 SQL
SELECT DISTINCT ON ("forum_post"."thread_id")
    "forum_post"."from_name",
    "forum_thread"."title",
    "forum_post"."timestamp",
    ts_rank("forum_post"."search_vector", plainto_tsquery('dolor')) AS "rank"
FROM
    "forum_post"
    INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id")
WHERE
    "forum_post"."search_vector" @@ (plainto_tsquery('dolor')) = TRUE
ORDER BY
    "forum_post"."thread_id" ASC,
    "rank" DESC,
    "forum_post"."timestamp" DESC

这些是我本地数据的搜索结果:

post2 from thread1 - timestamp 2019-06-01, rank 0.0759909
post1 from thread2 - timestamp 2019-06-01, rank 0.0759909
post2 from thread3 - timestamp 2019-04-01, rank 0.0759909

备注

您可以在官方 Django 文档中阅读有关 distinct 的更多信息。

更新

如果您需要绝对按时间戳倒序排序并且不需要显示排名,则可以在上一个查询之后使用子查询对您的帖子进行排序:

Post.objects.filter(
    pk__in=Subquery(
        Post.objects.search("dolor")
        .order_by("-thread", "-rank", "-timestamp")
        .distinct("thread")
        .values("id")
    )
).order_by("-timestamp")

这是在我本地 PostgreSQL:

上执行的 SQL
SELECT
    "forum_post"."from_name",
    "forum_thread"."title",
    "forum_post"."timestamp"
FROM
    "forum_post"
    INNER JOIN "forum_thread" ON ("forum_post"."thread_id" = "forum_thread"."id")
WHERE
    "forum_post"."id" IN ( SELECT DISTINCT ON (U0. "thread_id")
            U0. "id"
        FROM
            "forum_post" U0
        WHERE
            U0. "search_vector" @@ (plainto_tsquery('dolor')) = TRUE
        ORDER BY
            U0. "thread_id" DESC,
            ts_rank(U0. "search_vector", plainto_tsquery('dolor'))
            DESC,
            U0. "timestamp" DESC)
ORDER BY
    "forum_post"."timestamp" DESC

这些是我本地数据的搜索结果:

post2 from thread1 - timestamp 2019-06-01
post1 from thread2 - timestamp 2019-06-01
post2 from thread3 - timestamp 2019-04-01