提高多对多关系中的查询速度

Question

为了自学编程，我正在制作一个网络应用程序（Flask、SQLAlchemy、Jijna）来显示我从亚马逊订购的所有书籍。

以“最基本”的可能方式，我正在尝试学习如何复制 http://pinboard.in——那是我的典范。我不知道他的网站如何运行得如此之快：我可以加载 160 个书签条目——全部带有相关标签——我不知道，500 毫秒？ ...这就是为什么我知道我做错了什么，如下所述。

无论如何，我在 books Class 和 tag Class 之间创建了多对多关系，这样用户就可以 (1)单击 book 并查看其所有 tags，以及 (2) 单击 tag 并查看所有关联的书籍。这是我的 table 架构：

Entity relationship diagram

这里是两个Class的关系代码：

assoc = db.Table('assoc',
    db.Column('book_id', db.Integer, db.ForeignKey('books.book_id')),
    db.Column('tag_id', db.Integer, db.ForeignKey('tags.tag_id'))
)

class Book(db.Model):
    __tablename__ = 'books'
    book_id = db.Column(db.Integer, primary_key=True)
    title = db.Column(db.String(120), unique=True)
    auth = db.Column(db.String(120), unique=True)
    comment = db.Column(db.String(120), unique=True)
    date_read = db.Column(db.DateTime)
    era = db.Column(db.String(36))
    url = db.Column(db.String(120))
    notable = db.Column(db.String(1))

    tagged = db.relationship('Tag', secondary=assoc, backref=db.backref('thebooks',lazy='dynamic'))

    def __init__(self, title, auth, comment, date_read, url, notable):
        self.title = title
        self.auth = auth
        self.comment = comment
        self.date_read = date_read
        self.era = era
        self.url = url
        self.notable = notable
    
class Tag(db.Model):
    __tablename__ = 'tags'
    tag_id = db.Column(db.Integer, primary_key=True)
    tag_name = db.Column(db.String(120))

问题

如果我仅遍历 books table（~400 行），查询将以闪电般的速度运行并呈现给浏览器。没问题。

{% for i in book_query %}
    <li>
      {{i.notable}}{{i.notable}}
      <a href="{{i.url}}">{{i.title}}</a>, {{i.auth}}
      <a href="/era/{{i.era}}">{{i.era}}</a> {{i.date_read}}
        {% if i.comment %}
          <p>{{i.comment}}</p>
        {% else %}
          <!-- print nothing -->
        {% endif %}
    </li>
{% endfor %}

但是，如果我想显示与一本书关联的所有标签，我通过嵌套 for loop 来更改代码，如下所示：

{% for i in book_query %}
    <li>
      {{i.notable}}{{i.notable}}
      <a href="{{i.url}}">{{i.title}}</a>, {{i.auth}}
      <a href="/era/{{i.era}}">{{i.era}}</a>
        {% for ii in i.tagged %}
            <a href="/tag/{{ii.tag_name}}">{{ii.tag_name}}</a>
        {% endfor %}
      {{i.date_read}}
        {% if i.comment %}
          <p>{{i.comment}}</p>
        {% else %}
          <!-- print nothing -->
        {% endif %}
    </li>
  {% endfor %}

查询显着变慢（大约需要 20 秒）。我的理解是，发生这种情况是因为对于 book table 中的每一行，我的代码都在遍历整个 assoc table（即“完整 table 扫描”）。

讨论（或“我认为正在发生的事情”）

显然，我是一个彻头彻尾的菜鸟——我已经编程了大约 3 个月。只是让事情顺利进行是一种激励，但我意识到我在知识库中有很大的空白，我正在努力填补这些空白。

马上，我意识到这是非常低效的，对于每本新书，代码都在遍历 整个关联 table（如果那是确实发生了什么，我相信它是）。我想我需要对 assoc table 进行聚类（？）或排序（？），这样一旦我检索到 book with book_id == 1 的所有标签，我就再也不会“检查”这些行book_id == 1 在 assoc table.

换句话说，我认为正在发生的事情是这样的（用计算机语言来说）：

哦，他想知道 books table 中带有 book_id == 1 的书是如何被标记的
好吧，我去assoctable
第 1 行 ... assoc table 中的 book_id 是否等于 1？
好的，是的；那么第 1 行的 tag_id 是什么？ ... [然后计算机去 tag table 得到 tag_name，然后 returns 它到浏览器]
第 2 行 ... assoc table 中的 book_id 等于 1?
哦，不，不是...好吧，转到第 3 行
嗯嗯，因为我的程序员很愚蠢，没有以某种方式对 table 进行排序或索引，我将不得不经历整个 assoc table 正在寻找 book_id == 1 可能已经没有了 ...

然后，一旦我们在 books table 中到达 book_id == 2，计算机就会变得非常疯狂：

好吧，他想知道所有与 book_id == 2
好吧，我去assoctable
第 1 行……等一下……我不是已经检查过这个了吗？我的天啊#t，我必须重新做一遍？？
该死的……好吧……第 1 行……是 book_id == 2？（我知道不是！但我还是得检查一下，因为我的程序员是个笨蛋……）

问题

所以问题是，我可以 (1) 对 assoc table 进行排序（？）或聚类（？），以确保通过 [=21 进行更“智能”的遍历=] table，或者，正如我的一个朋友建议的那样，我 (2) 是否“学会编写好的 SQL 查询”？（请注意，我从来没有学过 SQL，因为我一直在用 SQL 炼金术处理一切......该死的炼金术士......将他们的魔法隐藏起来等等。）

最后的话

感谢您的任何意见。如果您有任何建议可以帮助我改进在 Whosebug 上提问的方式（这是我的第一个 post！），请告诉我。

Answer 1

大部分答案都在问题中。

在第一个示例中，当您遍历 books table 时，将执行 1 SQL 查询。在第二个示例中，对每个 Book 执行单独的 assoc 查询。因此大约需要 400 SQL 个查询，非常耗时。如果设置 SQLALCHEMY_ECHO 配置参数，您可以在应用程序调试日志中查看它们：

app.config['SQLALCHEMY_ECHO'] = True

或者您可以安装 Flask-DebugToolbar 并在 Web 界面中查看这些查询。

处理此问题的最佳方法是学习 SQL 基础知识，当您的应用程序变大时，您无论如何都需要它们。尝试在纯 SQL 中编写更优化的查询。对于您的情况，它可能看起来像这样：

SELECT books.*, tags.tag_name FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id

然后尝试在 SQLAlchemy 代码中重写它，然后在传递给 HTML 渲染器之前按书分组：

# Single query to get all books and their tags
query = db.session.query(Book, Tag.tag_name).join('tagged')
# Dictionary of data to be passed to renderer
books = {}
for book, tag_name in query:
    book_data = books.setdefault(book.book_id, {'book': book, 'tags': []})
    book_data['tags'].append(tag_name)
# Rendering HTML
return render_template('yourtemplate.html', books=books)

模板代码如下所示：

{% for book in books %}
<li>
  {{ book.book.notable }}{{ book.book.notable }}
  <a href="{{ book.book.url }}">{{ book.book.title }}</a>, {{ book.book.auth }}
  <a href="/era/{{ book.book.era }}">{{ book.book.era }}</a>
  {% for tag in book.tags %}
    &nbsp;<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>&nbsp;
  {% endfor %}
  {{ book.book.date_read }}
    {% if book.book.comment %}
      <p>{{ book.book.comment }}</p>
    {% else %}
      <!-- print nothing -->
    {% endif %}
</li>
{% endfor %}

另一种方法

如果你的数据库是PostgreSQL你可以写这样的查询：

SELECT books.title, books.auth (...), array_agg(tags.tag_name) as book_tags FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id
GROUP BY books.title, books.auth (...)

在这种情况下，您将获得带有已聚合标签的书籍数据作为数组。 SQLAlchemy 允许您进行这样的查询：

from sqlalchemy import func

books = db.session.query(Book, func.array_agg(Tag.tag_name)).\
    join('tagged').group_by(Book).all()
return render_template('yourtemplate.html', books=books)

并且模板具有以下结构：

{% for book, tags in books %}
<li>
  {{ book.notable }}{{ book.notable }}
  <a href="{{ book.url }}">{{ book.title }}</a>, {{ book.auth }}
  <a href="/era/{{ book.era }}">{{ book.era }}</a>
  {% for tag in tags %}
    &nbsp;<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>&nbsp;
  {% endfor %}
  {{ book.date_read }}
    {% if book.comment %}
      <p>{{ book.comment }}</p>
    {% else %}
      <!-- print nothing -->
    {% endif %}
</li>
{% endfor %}

Answer 2

如果您的查询有很多书，在单独的 SQL 语句中一一获取每本书的标签会减少您在网络中的响应时间 I/O。

如果您知道您始终需要此查询的标签，那么优化它的一种方法是提示 SQLAlchemy 通过连接或子查询获取一个查询中的所有相关标签。

我没有看到您的查询，但我猜子查询加载最适合您的用例：

session.query(Book).options(subqueryload('tagged')).filter(...).all()

Answer 3

以下改编自@Sergey-Shubin 的实现是对这个问题的可行解决方案：

类&table协会声明

assoc = db.Table('assoc',
    db.Column('book_id', db.Integer, db.ForeignKey('books.book_id')),
    db.Column('tag_id', db.Integer, db.ForeignKey('tags.tag_id'))
    )

class Book(db.Model):
    __tablename__ = 'books'
    book_id = db.Column(db.Integer, primary_key=True)
    title = db.Column(db.String(120), unique=True)
    auth = db.Column(db.String(120), unique=True)
    comment = db.Column(db.String(120), unique=True)
    date_read = db.Column(db.DateTime)
    era = db.Column(db.String(36))
    url = db.Column(db.String(120))
    notable = db.Column(db.String(1))    

    tagged = db.relationship('Tag', secondary=assoc, backref=db.backref('thebooks',lazy='dynamic'))

class Tag(db.Model):
    __tablename__ = 'tags'
    tag_id = db.Column(db.Integer, primary_key=True)
    tag_name = db.Column(db.String(120))

def construct_dict(query):
        books_dict = {}
        for each in query: # query is {<Book object>, <Tag object>} in the style of assoc table - therefore, must make a dictionary bc of the multiple tags per Book object
            book_data = books_dict.setdefault(each[0].book_id, {'bookkey':each[0], 'tagkey':[]}) # query is a list of like this {index-book_id, {<Book object>}, {<Tag object #1>, <Tag object #2>, ... }}
            book_data['tagkey'].append(each[1])
        return books_dict

路线，sql-炼金查询

@app.route('/query')
def query():
    query = db.session.query(Book, Tag).outerjoin('tagged') # query to get all books and their tags
    books_dict = construct_dict(query)

    return render_template("query.html", query=query, books_dict=books_dict)

提高多对多关系中的查询速度

Improve speed of query in a many-to-many relationship

sql

database

indexing

sqlalchemy

clustered-index

问题

讨论（或“我认为正在发生的事情”）

问题

最后的话

类&table协会声明

路线，sql-炼金查询