如何不规范化连续数据(INTS、FLOATS、DATETIME,....)?
How do I not normalize continuous data (INTS, FLOATS, DATETIME, ....)?
根据我的理解-如果我错了请纠正我-"Normalization"是从数据库设计中删除冗余数据的过程
但是,当我尝试了解数据库 optimizing/tuning 的性能时,我遇到 Mr. Rick James 建议 against 规范化连续值,例如 (INTS , 浮点数, 日期时间, ...)
"Normalize, but don't over-normalize." In particular, do not normalize
datetimes or floats or other "continuous" values.
Sure purists say normalize time. That is a big mistake. Generally,
"continuous" values should not be normalized because you generally
want to do range queries on them. If it is normalized, performance
will be orders of magnitude worse.
Normalization has several purposes; they don't really apply here:
Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings
To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each
time was independently assigned, and you are not a Time Lord.
In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still
sucks.
Do not normalize "continuous" values -- dates, floats, etc --
especially if you will do range queries.
我试图理解这一点,但我不能,有人可以向我解释一下并给我一个应用此规则会提高性能的最坏情况的例子吗?
注意:我本可以在评论或其他内容中问他,但我想单独记录并强调这一点,因为我相信这是非常重要的注意事项,几乎会影响我的整个数据库性能
评论(到目前为止)正在讨论术语 "normalization" 的误用。我接受这种批评。正在讨论的内容有术语吗?
让我用这个例子详细说明我的 'claim'...一些 DBA 将 DATE
替换为代理 ID;当使用日期范围时,这可能会导致严重的性能问题。对比这些:
-- single table
SELECT ...
FROM t
WHERE x = ...
AND date BETWEEN ... AND ...; -- `date` is of datatype DATE/DATETIME/etc
-- extra table
SELECT ...
FROM t
JOIN Dates AS d ON t.date_id = d.date_id
WHERE t.x = ...
AND d.date BETWEEN ... AND ...; -- Range test is now in the other table
将范围测试移动到 JOINed
table 会导致减速。
第一个查询可以通过
进行优化
INDEX(x, date)
在第二个查询中,优化器将(至少 MySQL)从两个 table 中选择一个开始,然后对其他 table 来处理其余的 WHERE
。 (其他引擎使用的有其他技术,但仍然有很大的成本。)
DATE
是您可能要进行 "range" 测试的几种数据类型之一。因此,我关于它适用于任何 "continuous" 数据类型(整数、日期、浮点数)的声明。
即使您没有范围测试,辅助 table 也可能没有性能优势。我经常看到 3 字节的 DATE
被 4 字节的 INT
替换,从而使主 table 变大了! "composite" 索引几乎总是会导致对单table 方法的更有效查询。
根据我的理解-如果我错了请纠正我-"Normalization"是从数据库设计中删除冗余数据的过程
但是,当我尝试了解数据库 optimizing/tuning 的性能时,我遇到 Mr. Rick James 建议 against 规范化连续值,例如 (INTS , 浮点数, 日期时间, ...)
"Normalize, but don't over-normalize." In particular, do not normalize datetimes or floats or other "continuous" values.
Sure purists say normalize time. That is a big mistake. Generally, "continuous" values should not be normalized because you generally want to do range queries on them. If it is normalized, performance will be orders of magnitude worse.
Normalization has several purposes; they don't really apply here:
Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings
To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each time was independently assigned, and you are not a Time Lord.
In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still sucks.
Do not normalize "continuous" values -- dates, floats, etc -- especially if you will do range queries.
我试图理解这一点,但我不能,有人可以向我解释一下并给我一个应用此规则会提高性能的最坏情况的例子吗?
注意:我本可以在评论或其他内容中问他,但我想单独记录并强调这一点,因为我相信这是非常重要的注意事项,几乎会影响我的整个数据库性能
评论(到目前为止)正在讨论术语 "normalization" 的误用。我接受这种批评。正在讨论的内容有术语吗?
让我用这个例子详细说明我的 'claim'...一些 DBA 将 DATE
替换为代理 ID;当使用日期范围时,这可能会导致严重的性能问题。对比这些:
-- single table
SELECT ...
FROM t
WHERE x = ...
AND date BETWEEN ... AND ...; -- `date` is of datatype DATE/DATETIME/etc
-- extra table
SELECT ...
FROM t
JOIN Dates AS d ON t.date_id = d.date_id
WHERE t.x = ...
AND d.date BETWEEN ... AND ...; -- Range test is now in the other table
将范围测试移动到 JOINed
table 会导致减速。
第一个查询可以通过
进行优化INDEX(x, date)
在第二个查询中,优化器将(至少 MySQL)从两个 table 中选择一个开始,然后对其他 table 来处理其余的 WHERE
。 (其他引擎使用的有其他技术,但仍然有很大的成本。)
DATE
是您可能要进行 "range" 测试的几种数据类型之一。因此,我关于它适用于任何 "continuous" 数据类型(整数、日期、浮点数)的声明。
即使您没有范围测试,辅助 table 也可能没有性能优势。我经常看到 3 字节的 DATE
被 4 字节的 INT
替换,从而使主 table 变大了! "composite" 索引几乎总是会导致对单table 方法的更有效查询。