django select_related - 何时使用它

django select_related - when to use it

我正在尝试在 Django 中优化我的 ORM 查询。我使用 connection.queries 查看 django 为我生成的查询。

假设我有这些模型:

class Book(models.Model):
    name   = models.CharField(max_length=50)
    author = models.ForeignKey(Author)

class Author(models.Model):
    name   = models.CharField(max_length=50)

比方说,当我生成一个特定的网页时,我想显示所有书籍,每本书旁边都有作者姓名。另外,我把所有的作者分开显示。

所以我应该使用

Book.objects.all().select_related("author")

这将导致 JOIN 查询。即使我之前做了一行:

Author.objects.all()

显然在模板中我会写类似 {{book.author.name}}.
的东西 所以问题是,当我访问一个外键值(作者)时,如果 django 已经从另一个查询中获得了该对象,这是否仍会导致额外的查询(对于每本书)? 如果不是,那么在那种情况下,使用 select_related 是否真的会产生性能开销?

Book.objects.select_related("author")

够用了。不需要 Author.objects.all()

{{ book.author.name }}

不会访问数据库,因为 book.author 已经预先填充。

Django 不知道其他查询! Author.objects.all()Book.objects.all() 是完全不同的查询集。因此,如果在您的视图中同时拥有它们并将它们传递给模板上下文,但在您的模板中您会执行以下操作:

{% for book in books %}
  {{ book.author.name }}
{% endfor %}

并且有 N 本书,这将导致 N 额外的数据库查询(除了获取所有书籍和作者的查询之外)!

如果您已经完成 Book.objects.all().select_related("author"),则不会在上面的模板片段中执行额外的查询。

现在,select_related() 当然会增加一些查询开销。发生的事情是,当您执行 Book.objects.all() 时,django 将 return SELECT * FROM BOOKS 的结果。相反,如果您执行 Book.objects.all().select_related("author") django 将 return 结果 SELECT * FROM BOOKS B LEFT JOIN AUTHORS A ON B.AUTHOR_ID = A.ID。因此,对于每本书,它都会 return 这本书的栏目及其相应的作者。但是,与访问数据库 N 次(如前所述)的开销相比,这种开销确实要小得多。

因此,即使 select_related 产生了很小的性能开销(每次查询 return 数据库中的更多字段),使用它实际上是有益的,除非您完全确定自己'将需要 您正在查询的特定模型的列。

最后,真正查看数据库中实际执行了多少(以及哪些)查询的好方法是使用 django-debug-tooblar (https://github.com/django-debug-toolbar/django-debug-toolbar)。

您实际上是在问两个不同的问题:

1.使用 select_related 实际上会产生性能开销吗?

您应该查看有关 Django Query Cache 的文档:

Understand QuerySet evaluation

To avoid performance problems, it is important to understand:

  • that QuerySets are lazy.

  • when they are evaluated.

  • how the data is held in memory.

总而言之,Django 会在内存中缓存同一个 QuerySet 对象中评估的结果,也就是说,如果您执行类似的操作:

books = Book.objects.all().select_related("author")
for book in books:
    print(book.author.name)  # Evaluates the query set, caches in memory results
first_book = books[1]  # Does not hit db
print(first_book.author.name)  # Does not hit db  

当您在 select_related 中预取作者时,只会命中数据库一次,所有这些东西将导致使用 INNER JOIN.

的单个数据库查询

但是这不会在查询集之间做任何缓存,即使是相同的查询也是如此:

books = Book.objects.all().select_related("author")
books2 = Book.objects.all().select_related("author")
first_book = books[1]  # Does hit db
first_book = books2[1]  # Does hit db

这实际上在docs中指出:

We will assume you have done the obvious things above. The rest of this document focuses on how to use Django in such a way that you are not doing unnecessary work. This document also does not address other optimization techniques that apply to all expensive operations, such as general purpose caching.

2。如果 django 已经从另一个查询中获得了那个对象,那是否还会导致额外的查询(对于每本书)?

你的意思实际上是如果 Django 做 ORM 查询缓存,这是一个非常不同的事情。 ORM查询缓存,也就是说,如果你在before做一个查询,然后你在later做同样的查询,如果数据库没有改变,结果来自缓存,而不是来自 昂贵的 数据库查找。

答案不是 Django,官方不支持,但非官方支持,通过 3rd 方应用支持。启用此类缓存的最相关的第三方应用程序是:

  1. Johnny-Cache(比较老,不支持django>1.6)
  2. Django-Cachalot(较新,支持 1.6、1.7,并且仍在开发 1.8 中)
  3. Django-Cacheops(较新,支持Python 2.7或3.3+,Django 1.8+和Redis 2.6+(推荐4.0+))

看看那些如果你寻找查询缓存并记住,首先分析,找到瓶颈,如果它们导致问题然后优化。

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming. Donald Knuth.

Select_related

select_related 是一个可选的性能助推器,通过它进一步访问 Queryset 中 foreign_keys 的 属性 不会命中数据库。

Design philosophies

This is also why the select_related() QuerySet method exists. It’s an optional performance booster for the common case of selecting “every related object.”

Django official doc

Returns a QuerySet that will “follow” foreign-key relationships, selecting additional related-object data when it executes its query. This is a performance booster which results in a single more complex query but means later use of foreign-key relationships won’t require database queries.

正如定义中所指出的,仅在 foreign_key 关系 中允许使用 select_related。忽略此规则将面临以下异常:

In [21]: print(Book.objects.select_related('name').all().query)

FieldError: Non-relational field given in select_related: 'name'. Choices are: author

让我们通过一个例子深入研究它:

这是我的models.py。 (与问题相同)

from django.db import models


class Author(models.Model):
    name = models.CharField(max_length=50)

    def __str__(self):
        return self.name

    __repr__ = __str__


class Book(models.Model):
    name = models.CharField(max_length=50)
    author = models.ForeignKey(Author, related_name='books', on_delete=models.DO_NOTHING)

    def __str__(self):
        return self.name

    __repr__ = __str__
  • 使用 relect_related 助推器获取所有书籍及其作者:
In [25]: print(Book.objects.select_related('author').all().explain(verbose=True, analyze=True))
Hash Join  (cost=328.50..548.39 rows=11000 width=54) (actual time=3.124..8.013 rows=11000 loops=1)
  Output: library_book.id, library_book.name, library_book.author_id, library_author.id, library_author.name
  Inner Unique: true
  Hash Cond: (library_book.author_id = library_author.id)
  ->  Seq Scan on public.library_book  (cost=0.00..191.00 rows=11000 width=29) (actual time=0.008..1.190 rows=11000 loops=1)
        Output: library_book.id, library_book.name, library_book.author_id
  ->  Hash  (cost=191.00..191.00 rows=11000 width=25) (actual time=3.086..3.086 rows=11000 loops=1)
        Output: library_author.id, library_author.name
        Buckets: 16384  Batches: 1  Memory Usage: 741kB
        ->  Seq Scan on public.library_author  (cost=0.00..191.00 rows=11000 width=25) (actual time=0.007..1.239 rows=11000 loops=1)
              Output: library_author.id, library_author.name
Planning Time: 0.234 ms
Execution Time: 8.562 ms

In [26]: print(Book.objects.select_related('author').all().query)
SELECT "library_book"."id", "library_book"."name", "library_book"."author_id", "library_author"."id", "library_author"."name" FROM "library_book" INNER JOIN "library_author" ON ("library_book"."author_id" = "library_author"."id")

如您所见,使用 select_related 会在提供的外键上造成 INNER JOIN(这里是 author)。

执行时间:

  • 运行使用计划者选择的最快计划的查询
  • 返回结果

8.562毫秒

另一方面:

  • 在不使用 relect_related 助推器的情况下获取所有书籍及其作者:
In [31]: print(Book.objects.all().explain(verbose=True, analyze=True))
Seq Scan on public.library_book  (cost=0.00..191.00 rows=11000 width=29) (actual time=0.017..1.349 rows=11000 loops=1)
  Output: id, name, author_id
Planning Time: 1.135 ms
Execution Time: 2.536 ms

In [32]: print(Book.objects.all().query)
SELECT "library_book"."id", "library_book"."name", "library_book"."author_id" FROM "library_book

如您所见,这只是一个简单的 SELECT 查询书本模型,仅包含 author_id.在这种情况下,执行时间为 2.536 ms.

如Django中所述doc:

进一步访问 foreign-key 属性将导致对数据库的另一次访问:(因为我们还没有它们)

In [33]: books = Book.objects.all()

In [34]: for book in books:
    ...:     print(book.author) # Hit the database

另请参见查询集中的 Database access optimization and explain() API 参考资料

Django Database Caching:

Django comes with a robust cache system that lets you save dynamic pages so they don’t have to be calculated for each request. For convenience, Django offers different levels of cache granularity: You can cache the output of specific views, you can cache only the pieces that are difficult to produce, or you can cache your entire site.

Django also works well with “downstream” caches, such as Squid and browser-based caches. These are the types of caches that you don’t directly control but to which you can provide hints (via HTTP headers) about which parts of your site should be cached, and how.

您应该阅读这些文档,找出最适合您的文档。


PS1: 要获得有关规划器及其工作原理的更多信息,请参阅 and Using EXPLAIN)