使用大型子查询优化 Django 查询

Question

我有一个包含 Profile 和 Relationship 模型的数据库。我没有在模型中明确链接它们（因为它们是第三方 ID，它们可能还不存在于两个 table 中），但是 source 和 target 字段映射到一个或更多 Profile 个对象，通过 id 字段：

from django.db import models
class Profile(models.Model):
    id = models.BigIntegerField(primary_key=True)
    handle = models.CharField(max_length=100)

class Relationship(models.Model):
    id = models.AutoField(primary_key=True)
    source = models.BigIntegerField(db_index=True)
    target = models.BigIntegerField(db_index=True)

我的查询需要从 Relationship.source 列中获取 100 个值的列表，这些值还不存在 Profile.id。然后，该列表将用于从第三方收集必要的数据。下面的查询有效，但是随着 table 的增长 (10m+)，子查询变得非常大而且很慢。

有什么优化建议吗？后端是 PostgreSQL，但如果可能，我想使用原生 Django ORM。

编辑：额外的复杂性将导致查询缓慢。并非所有 ID 都保证 return 成功，这意味着它们会继续 "not exist" 并使程序陷入无限循环。所以我添加了 filter 和 order_by 以输入前一批 100 中最高的 id。这将导致一些问题，因此对于最初遗漏它表示歉意。

from django.db.models import Subquery
user = Profile.objects.get(handle="philsheard")
qs_existing_profiles = Profiles.objects.all()
rels = TwitterRelationship.objects.filter(
    target=user.id,
).exclude(
    source__in=Subquery(qs_existing_profiles.values("id"))
).values_list(
    "source", flat=True
).order_by(
    "source"
).filter(
    source__gt=max_id_from_previous_batch  # An integer representing a previous `Relationship.source` id
)

提前感谢您的任何建议！

Answer 1

对于未来的搜索者，以下是我绕过 __in 查询并能够加快结果的方法。

from django.db.models import Subquery
from django.db.models import Count  # New

user = Profile.objects.get(handle="philsheard")
subq = Profile.objects.filter(profile_id=OuterRef("source"))  # New queryset to use within Subquery
rels = Relationship.objects.order_by(
    "source"
).annotate(
    # Annotate each relationship record with a Count of the times that the "source" ID
    # appears in the `Profile` table. We can then filter on those that have a count of 0
    # (ie don't appear and therefore haven't yet been connected)
    prof_count=Count(Subquery(subq.values("id")))
).filter(
    target=user.id,
    prof_count=0
).filter(
    source__gt=max_id_from_previous_batch  # An integer representing a previous `Relationship.source` id
).values_list(
    "source", flat=True
)

我认为这更快，因为查询将在达到所需的 100 个项目后完成（而不是每次都与 1m+ ID 的列表进行比较）。

使用大型子查询优化 Django 查询

Optimise Django query with large subquery

python

django

django-orm