如何不规范化连续数据(INTS、FLOATS、DATETIME,....)?

How do I not normalize continuous data (INTS, FLOATS, DATETIME, ....)?

根据我的理解-如果我错了请纠正我-"Normalization"是从数据库设计中删除冗余数据的过程

但是,当我尝试了解数据库 optimizing/tuning 的性能时,我遇到 Mr. Rick James 建议 against 规范化连续值,例如 (INTS , 浮点数, 日期时间, ...)

"Normalize, but don't over-normalize." In particular, do not normalize datetimes or floats or other "continuous" values.

Sure purists say normalize time. That is a big mistake. Generally, "continuous" values should not be normalized because you generally want to do range queries on them. If it is normalized, performance will be orders of magnitude worse.

Normalization has several purposes; they don't really apply here:

  • Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings

  • To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each time was independently assigned, and you are not a Time Lord.

  • In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still sucks.

source

Do not normalize "continuous" values -- dates, floats, etc -- especially if you will do range queries.

source.

我试图理解这一点,但我不能,有人可以向我解释一下并给我一个应用此规则会提高性能的最坏情况的例子吗?

注意:我本可以在评论或其他内容中问他,但我想单独记录并强调这一点,因为我相信这是非常重要的注意事项,几乎会影响我的整个数据库性能

评论(到目前为止)正在讨论术语 "normalization" 的误用。我接受这种批评。正在讨论的内容有术语吗?

让我用这个例子详细说明我的 'claim'...一些 DBA 将 DATE 替换为代理 ID;当使用日期范围时,这可能会导致严重的性能问题。对比这些:

-- single table
SELECT ...
    FROM t
    WHERE x = ...
      AND date BETWEEN ... AND ...;   -- `date` is of datatype DATE/DATETIME/etc

-- extra table
SELECT ...
    FROM t
    JOIN Dates AS d  ON t.date_id = d.date_id
    WHERE t.x = ...
      AND d.date BETWEEN ... AND ...;  -- Range test is now in the other table

将范围测试移动到 JOINed table 会导致减速。

第一个查询可以通过

进行优化
INDEX(x, date)

在第二个查询中,优化器将(至少 MySQL)从两个 table 中选择一个开始,然后对其他 table 来处理其余的 WHERE。 (其他引擎使用的有其他技术,但仍然有很大的成本。)

DATE 是您可能要进行 "range" 测试的几种数据类型之一。因此,我关于它适用于任何 "continuous" 数据类型(整数、日期、浮点数)的声明。

即使您没有范围测试,辅助 table 也可能没有性能优势。我经常看到 3 字节的 DATE 被 4 字节的 INT 替换,从而使主 table 变大了! "composite" 索引几乎总是会导致对单table 方法的更有效查询。