在 postgres 中提取唯一结果并降低星型模式的成本

Extracting unique result and reducing cost in star schema in postgres

我有以下星型模式,其中包含一堆 table,如下所示,以确定图书馆中图书的可用性

f_book_availability(这个 table 包括参考其他维度 table 的书籍的可用性,我将很快解释)索引主键



   +-------------------------------------------------------------------------------+
   |   id   |  book_id |publisherid | location_id | genre_id | date_id| available  |
   |        |          |            |             |          |        |            |
   +-------------------------------------------------------------------------------+
   |   1    |  1       |    1       |      72     |   1      |   1    |    1       |
   |        |          |            |             |          |        |            |
   +-------------------------------------------------------------------------------+
   |   2    |  2       |    1       |      60     |  2       |   1    |     1      |
   |        |          |            |             |          |        |            |
   +-------------------------------------------------------------------------------+

d_book - 此维度 table 包含有关书籍的详细信息,例如名称和类型。 Type 只有 id 1 和 2。1 表示“由出版商发布”,2 表示“自行发布”,没有任何引用 table。索引主键

+-----------+-----------+------------+
| id        | type      |  name      |
+------------------------------------+
| 1         |  1        |  LOR       |
+------------------------------------+
| 2         |  2        |  My life   |
+-----------+-----------+------------+

d_publisher : 这个维度 table 有发布者信息。索引主键

+-----------+------------
|  id       |   name    |
+-----------------------+
|  1        |   abc     |
+-----------------------+
|  2        |   def     |
+-----------+------------

d_location- 这个维度是一个棘手的维度。 (注意模式已经存在,我无法修改它)。它具有地点 ID 和父地点 ID。注意 id 我们保存层次结构,例如,如果您选择 id 72,它是一个叶节点并告诉库的机架,您可以看到每个叶节点有四个具有不同父节点的条目,您可以从中知道层次结构 国家->城市->图书馆->架子。这适用于每个地方。例如,如果您有图书馆的位置,那么您可以找到层次结构 country->City->Library。索引 1) 主键 (id & parentId), 2) parentId

+-----------+-----------+------------+------------+
| id        | parent_id |  name      | c_code     |
|           |           |            |            |
+-------------------------------------------------+
| 1         |  1        |  France    |  FR        |
+-------------------------------------------------+
| 4         |  1        |  Paris     |  FR        |
+-------------------------------------------------+
| 4         |  4        |  Paris     |  FR        |
+-------------------------------------------------+
| 25        |  1        |  GtLibrary |  FR        |
+-------------------------------------------------+
| 25        |  4        |  GtLibrary |  FR        |
+-------------------------------------------------+
| 25        |  25       |  GtLibrary |  FR        |
+-------------------------------------------------+
| 72        |  1        |  Rack1     |  FR        |
+-------------------------------------------------+
| 72        |  4        |  Rack1     |  FR        |
+-------------------------------------------------+
| 72        |  25       |  Rack1     |  FR        |
+-------------------------------------------------+
| 72        |  72       |  Rack1     |  FR        |
+-----------+-----------+------------+------------+

d_genre : 这个维度table有流派信息索引主键

+-----------+------------
|  id       |   name    |
+-----------------------+
|  1        |   fantasy |
+-----------------------+
|  2        |   horror  |
+-----------+------------

d_date :此维度 table 包含所有日期(注意还有其他列我没有显示,例如月、日、年、星期几、周数,但我不显示它只是为了简单只是为了让你知道它不仅仅是日期,因为它看起来很愚蠢 :) ) 索引 - 1) 主键 2) 日期

+-----------+--------------
|  id       |   date      |
+-------------------------+
|  1        |   2020-11-25|
+-------------------------+
|  2        |   2019-10-24|
+-----------+--------------

从这个 table 我试图通过诸如出版商、流派、日期、位置及其直接上级位置等信息来确定某本书在特定日期是否可用的信息。

我写了下面的查询

select 
    fba.id,
    location.c_code as country,
    parentLocation.name as parentPlace,
    location.id as locationId,
    location.name as locationName,
    publisher.id as "publisherId",
    publisher.name as publisherName,
    case when book.type = 1 then 'published' else 'self-published' end as "bookType",
    book.type as typeId,
    genre.name as genreName,
    book.id as "bookId",
    book.name as bookTitle,
    d."date",
    fba.available 
        from f_book_availability fba 
            join d_book book on fba.product_id = product.id 
            join d_publisher publisher on fba.publisherid = publisher.id
            join d_location location on fba.location_id = location.id 
            join d_location parentLocation on location.parent_id = parentLocation.id
            join d_genre genre on fba.genre_id = genre.id 
            join d_date d on fba.date_id = d.id
            where 
                location.id <> location.parent_id 
                and d."date" >= now() and d."date" <= '2020-12-01'
                and location.c_code ='FR'
                and book.type = 1
                and genre.name = 'fantasy'

实际输出

+--------+---------+-------------------------+--------------+----------------------------+----------+--------+------------+--------+-----------+-----------+-----------+
| fba_id | country | parentPlace| locationId | locationName | publisherid| publisherName | bookType | typeId | genreName  | bookId | bookTitle | date      | available |
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | France     | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | Paris      | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | Paris      | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | GtLibrary  | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | GtLibrary  | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | GtLibrary  | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+--------+---------+------------+------------+--------------+------------+---------------+-------------------+------------+--------+-----------+-----------------------+

预期输出:如您所见,由于与父位置和条件的连接,存在重复项。

  1. 我想要实现的最好的事情就是得到一排,这是机架和库的最后一行。如果 location 有一个 immediateParent 标志会更容易,但我无法更改架构
  2. 第二好的事情是拥有不同的记录,所以在这种情况下,只有 3 个记录机架与图书馆、机架与城市、机架与国家/地区。如果我无法实现第一个选项,那也没关系。然而,不同的条款有很多成本,我不知道如何降低成本

费用

Unique  (cost=12070116.66..12072185.22 rows=63648 width=179)
  ->  Sort  (cost=12070116.66..12070275.78 rows=63648 width=179)
        Sort Key: fba.id, parentLocation.name, location.id, location.name, publisher.id, publisher.name, (CASE WHEN (book.type = 1) THEN 'published'::text ELSE 'self-published'::text END), genre.name, book.id, book.name, d.date, fba.available
        ->  Hash Join  (cost=3316.39..12059378.75 rows=63648 width=179)
              Hash Cond: (location.parent_id= parentLocation.id)
              ->  Hash Join  (cost=2601.08..12057653.07 rows=19090 width=141)
                    Hash Cond: (fa.publisher_id = publisher.id)
                    ->  Hash Join  (cost=2475.28..12057477.06 rows=19090 width=120)
                          Hash Cond: (fba.date_id = d.id)
                          ->  Gather  (cost=2466.05..12051967.29 rows=2092656 width=124)
                                Workers Planned: 2
                                ->  Hash Join  (cost=1466.05..11841701.69 rows=871940 width=124)
                                      Hash Cond: (fba.location_id = location.id)
                                      ->  Hash Join  (cost=820.92..11816871.09 rows=1374762 width=99)
                                            Hash Cond: (fa.book_id = book.id)
                                            ->  Hash Join  (cost=8.30..11808952.45 rows=2706393 width=54)
                                                  Hash Cond: (fba.genre_id = genre.id)
                                                  ->  Parallel Seq Scan on f_book_availability fba  (cost=0.00..11047094.57 rows=278758458 width=45)
                                                  ->  Hash  (cost=8.29..8.29 rows=1 width=25)
                                                        ->  Seq Scan on d_genre genre(cost=0.00..8.29 rows=1 width=25)
                                                              Filter: ((tech_en)::text = 'fantasy'::text)
                                            ->  Hash  (cost=702.26..702.26 rows=8829 width=53)
                                                  ->  Seq Scan on d_book book  (cost=0.00..702.26 rows=8829 width=53)
                                                        Filter: (type = 1)
                                      ->  Hash  (cost=613.88..613.88 rows=2500 width=33)
                                            ->  Seq Scan on d_location location  (cost=0.00..613.88 rows=2500 width=33)
                                                  Filter: ((id <> parent_id) AND ((c_code)::text = 'FR'::text))
                          ->  Hash  (cost=8.86..8.86 rows=29 width=8)
                                ->  Index Scan using date_unique on d_date d  (cost=0.28..8.86 rows=29 width=8)
                                      Index Cond: ((date >= now()) AND (date <= '2020-12-01'::date))
                    ->  Hash  (cost=98.69..98.69 rows=2169 width=29)
                          ->  Seq Scan on d_publisher publisher  (cost=0.00..98.69 rows=2169 width=29)
              ->  Hash  (cost=546.25..546.25 rows=13525 width=22)
                    ->  Seq Scan on d_location parentLocation  (cost=0.00..546.25 rows=13525 width=22)

抱歉,如果 post 不应该在这里,如果有错字,因为我不得不工作两个小时来创建那个 ascii table 并且在我更改时可能会有错字列名只是为了举例

已添加为答案以便打勾,请 :)

d_location 中的数据没有根本性的错误(尽管我会 de-normalise 将所有父字段放入此 table)所以我并不是建议您需要重新设计那 - 这是 table 的关键设计是错误的,需要更正,你不应该在维度 table 中有一个复合 PK - 因为你已经发现你的查询不工作