规范化 table,其中需要引用另一个 table 中列的子集,并且这些子集必须是唯一的
Normalize a table where there is a need to reference subsets of a column in another table and those subsets must be unique
我如何标准化这种关系(即使其符合 1NF, 2NF, and 3NF)
CREATE TABLE IF NOT EXISTS series (
series_id SERIAL PRIMARY KEY,
dimension_ids INT[] UNIQUE,
dataset_id INT REFERENCES dataset(dataset_id) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS dimension (
dimension_id SERIAL PRIMARY KEY,
dim VARCHAR(50),
val VARCHAR(50),
dataset_id INT REFERENCES dataset(dataset_id) ON DELETE CASCADE,
UNIQUE (dim, val, dataset_id)
);
其中 dimension_id
的子集唯一标识 series
table 中的记录。
编辑
为了提供更多信息,我要存储的数据来自 XML 如下所示的结构
<?xml version="1.0" encoding="utf-8"?>
<message:StructureSpecificData >
<message:Header>
<message:ID>IREF757740</message:ID>
<message:Test>false</message:Test>
<message:Prepared>2020-04-09T14:55:23</message:Prepared>
</message:Header>
<message:DataSet ss:dataScope="DataStructure" ss:structureRef="CPI" xsi:type="ns1:DataSetType">
<Series FREQ="M" GEOG_AREA="WC" UNIT="IDX">
<Obs OBS_STATUS="A" OBS_VALUE="75.5" TIME_PERIOD="31-Jan-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="75.8" TIME_PERIOD="29-Feb-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77" TIME_PERIOD="31-Mar-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77.5" TIME_PERIOD="30-Apr-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78" TIME_PERIOD="31-May-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78.8" TIME_PERIOD="30-Jun-2008"/>
</Series>
<Series FREQ="M" GEOG_AREA="NC" UNIT="IDX">
<Obs OBS_STATUS="A" OBS_VALUE="75.5" TIME_PERIOD="31-Jan-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="75.8" TIME_PERIOD="29-Feb-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77" TIME_PERIOD="31-Mar-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77.5" TIME_PERIOD="30-Apr-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78" TIME_PERIOD="31-May-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78.8" TIME_PERIOD="30-Jun-2008"/>
</Series>
</message:DataSet>
</message:StructureSpecificData>
有一个包含 series
(0...n) 的数据集,其中包含观测值 (0...n)。该系列由它们的 XML 属性唯一标识 - 我在数据模型中称之为维度。在我的示例中,我有两个 series
,根据它们覆盖的地理区域进行区分。任何 series
都可以有任意数量的维度。 series
应该从它的维度中查询,维度也会使用 series_id
查询。显而易见的解决方案是桥接 table:
CREATE TABLE series_dimension
series_id INT REFERENCES series(series_id) ON DELETE CASCADE,
dimension_id INT REFERENCES dimension(dimension_id)
);
但是,此解决方案允许以下情况:
|--------------------------|
| series_dimension |
|-----------|--------------|
| series_id | dimension_id |
|-----------|--------------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 2 | 4 |
|-----------|--------------|
也就是说,两个不同的 series
具有相同的维度,所以如果我查询 series
给定的一组维度,我无法决定维度 [=25] =] 我是在寻找 series_id
=1 还是 series_id
=2 是 unacceptable。因此,在这种情况下,我是否必须在具有参照完整性和唯一性 属性我刚才解释的之间做出选择?
我的结论是这种关系(其中列指的是事先不知道其编号的属性)需要规范化导致创建多对多或一对多关系,这排除了独特的映射。
相反,对于列引用其编号事先未知的属性的关系,建立关系 one-to-one/unique 的方法是将这些属性分组到违反 1NF 的唯一子集中.
- 只有一种方法可以指定
UNIQUE
约束,那就是在列上
- 我的示例要求每个
series_id
引用 columns/dimensions 的 变量数
- 因此,我将列堆叠成行,结果
UNIQUE
结构不可用
- 解决方案中每个
series_id
都与指定行子集的数组相关,我现在可以指定此数组列为 UNIQUE
- 这违反了1NF,因此这种关系无法正常化
考虑到您对大约 20
尺寸的期望,示例仅限于 60
。它确实需要一个受控过程来定义每个 维度集 (系列)。
推理
-- DIM is a valid numeric identifier for a dimension.
--
valid_dim {DIM}
PK {DIM}
CHECK ((DIM = 1) OR ((DIM > 1) AND (mod(DIM,2) = 0)))
-- data sample
(DIM)
---------
(2^0)
, (2^1)
, (2^2)
, ...
, (2^58)
, (2^59)
-- Dimension DIM, named DIM_NAME exists.
--
dimension {DIM, DIM_NAME}
PK {DIM}
AK {DIM_NAME}
FK {DIM} REFERENCES valid_dim {DIM}
-- data sample
(DIM, DIM_NAME)
---------------
(2^0, 'FREQ')
, (2^1, 'GEOG_AREA')
, (2^2, 'UNIT')
, ...
, (2^58, 'AGE_GROUP')
, (2^59, 'HAIR_COLOR')
加载 series
和 ser_dim
可以从函数、应用程序或其他任何地方完成。但是,这应该是一个受控过程。
SER
对于给定的一组维度是唯一的。
请注意 |
是按位 OR
运算符。
-- Series SER, named SER_NAME exists.
--
series {SER, SER_NAME}
PK {SER}
AK {SER_NAME}
-- data sample
(SER, SER_NAME)
--------------------------------
((2^0 | 2^1 | 2^2) , 'F-G-U')
, ((2^1 | 2^58) , 'G-A' )
, ((2^0 | 2^58 | 2^59), 'F-A-H')
-- Series SER has dimension DIM.
--
ser_dim {SER, DIM}
PK {SER, DIM}
FK1 {SER} REFERENCES series {SER}
FK2 {DIM} REFERENCES dimension {DIM}
CHECK ((DIM & SER) = DIM)
-- data sample
(SER, DIM)
--------------------------------
((2^0 | 2^1 | 2^2) , 2^0)
, ((2^0 | 2^1 | 2^2) , 2^1)
, ((2^0 | 2^1 | 2^2) , 2^2)
, ((2^1 | 2^58) , 2^1 )
, ((2^1 | 2^58) , 2^58)
, ((2^0 | 2^58 | 2^59), 2^0)
, ((2^0 | 2^58 | 2^59), 2^58)
, ((2^0 | 2^58 | 2^59), 2^59)
注:
All attributes (columns) NOT NULL
PK = Primary Key
AK = Alternate Key (Unique)
FK = Foreign Key
PostgreSQL
-- DIM is a valid numeric identifier
-- for a dimension.
--
CREATE TABLE valid_dim (
DIM bigint NOT NULL
, CONSTRAINT pk_valid_dim PRIMARY KEY (DIM)
, CONSTRAINT chk_valid_dim
CHECK ( (DIM = 1)
OR ( (DIM > 1)
AND (mod(DIM, 2) = 0) )
)
);
-- define some of valid DIMs
INSERT INTO valid_dim (DIM)
VALUES
((2^ 0)::bigint)
, ((2^ 1)::bigint)
, ((2^ 2)::bigint)
-- fill this gap
, ((2^58)::bigint)
, ((2^59)::bigint) ;
-- Dimension DIM, named DIM_NAME exists.
--
CREATE TABLE dimension (
DIM bigint NOT NULL
, DIM_NAME text NOT NULL
, CONSTRAINT pk_dim PRIMARY KEY (DIM)
, CONSTRAINT ak_dim UNIQUE (DIM_NAME)
, CONSTRAINT
fk_dim FOREIGN KEY (DIM)
REFERENCES valid_dim (DIM)
);
-- define few dimensions
INSERT INTO dimension (DIM, DIM_NAME)
VALUES
((2^ 0)::bigint, 'FREQ')
, ((2^ 1)::bigint, 'GEOG_AREA')
, ((2^ 2)::bigint, 'UNIT')
, ((2^58)::bigint, 'AGE_GROUP')
, ((2^59)::bigint, 'HAIR_COLOR') ;
-- Series SER, named SER_NAME exists.
--
CREATE TABLE series (
SER bigint NOT NULL
, SER_NAME text NOT NULL
, CONSTRAINT pk_series PRIMARY KEY (SER)
, CONSTRAINT ak_series UNIQUE (SER_NAME)
);
-- define three series
INSERT INTO series (SER, SER_NAME)
SELECT bit_or(DIM) as SER, 'F-G-U' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT')
UNION
SELECT bit_or(DIM) as SER, 'G-A' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('GEOG_AREA', 'AGE_GROUP')
UNION
SELECT bit_or(DIM) as SER, 'F-A-H' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'AGE_GROUP', 'HAIR_COLOR') ;
-- Series SER has dimension DIM.
--
CREATE TABLE ser_dim (
SER bigint NOT NULL
, DIM bigint NOT NULL
, CONSTRAINT pk_ser_dim PRIMARY KEY (SER, DIM)
, CONSTRAINT
fk1_ser_dim FOREIGN KEY (SER)
REFERENCES series (SER)
, CONSTRAINT
fk2_ser_dim FOREIGN KEY (DIM)
REFERENCES dimension (DIM)
, CONSTRAINT
chk_ser_dim CHECK ((DIM & SER) = DIM)
);
-- populate ser_dim
INSERT INTO ser_dim (SER, DIM)
SELECT SER, DIM
FROM series
JOIN dimension ON true
WHERE (DIM & SER) = DIM ;
另一种选择是为 ser_dim
使用(物化)视图。这取决于模型的其余部分:如果需要 FK
来 {SER, DIM}
保留 table,否则视图会更好。
-- An option, instead of the table.
--
CREATE VIEW ser_dim
AS
SELECT SER, DIM
FROM series
JOIN dimension ON true
WHERE (DIM & SER) = DIM ;
测试
-- Show already defined series
-- and their dimensions.
SELECT SER_NAME, DIM_NAME
FROM ser_dim
JOIN series USING (SER)
JOIN dimension USING (DIM)
ORDER BY SER_NAME, DIM_NAME ;
-- Get SER for a set of dimensions;
-- use this when defining a series.
SELECT bit_or(DIM) AS SER
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT') ;
-- Find already defined series,
-- given a set of dimensions.
SELECT x.SER
FROM (
SELECT bit_or(DIM) AS SER
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT')
) AS x
WHERE EXISTS
(SELECT 1 FROM series AS s WHERE s.SER = x.SER) ;
总结
不幸的是,标准 SQL 实现不支持断言、数据库范围的约束。 SQL 标准实际上定义了它们,但还没有运气。因此,并非每个业务约束都可以在 SQL 中优雅地完成,通常需要一些创造力和妥协。
我如何标准化这种关系(即使其符合 1NF, 2NF, and 3NF)
CREATE TABLE IF NOT EXISTS series (
series_id SERIAL PRIMARY KEY,
dimension_ids INT[] UNIQUE,
dataset_id INT REFERENCES dataset(dataset_id) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS dimension (
dimension_id SERIAL PRIMARY KEY,
dim VARCHAR(50),
val VARCHAR(50),
dataset_id INT REFERENCES dataset(dataset_id) ON DELETE CASCADE,
UNIQUE (dim, val, dataset_id)
);
其中 dimension_id
的子集唯一标识 series
table 中的记录。
编辑
为了提供更多信息,我要存储的数据来自 XML 如下所示的结构
<?xml version="1.0" encoding="utf-8"?>
<message:StructureSpecificData >
<message:Header>
<message:ID>IREF757740</message:ID>
<message:Test>false</message:Test>
<message:Prepared>2020-04-09T14:55:23</message:Prepared>
</message:Header>
<message:DataSet ss:dataScope="DataStructure" ss:structureRef="CPI" xsi:type="ns1:DataSetType">
<Series FREQ="M" GEOG_AREA="WC" UNIT="IDX">
<Obs OBS_STATUS="A" OBS_VALUE="75.5" TIME_PERIOD="31-Jan-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="75.8" TIME_PERIOD="29-Feb-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77" TIME_PERIOD="31-Mar-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77.5" TIME_PERIOD="30-Apr-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78" TIME_PERIOD="31-May-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78.8" TIME_PERIOD="30-Jun-2008"/>
</Series>
<Series FREQ="M" GEOG_AREA="NC" UNIT="IDX">
<Obs OBS_STATUS="A" OBS_VALUE="75.5" TIME_PERIOD="31-Jan-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="75.8" TIME_PERIOD="29-Feb-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77" TIME_PERIOD="31-Mar-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="77.5" TIME_PERIOD="30-Apr-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78" TIME_PERIOD="31-May-2008"/>
<Obs OBS_STATUS="A" OBS_VALUE="78.8" TIME_PERIOD="30-Jun-2008"/>
</Series>
</message:DataSet>
</message:StructureSpecificData>
有一个包含 series
(0...n) 的数据集,其中包含观测值 (0...n)。该系列由它们的 XML 属性唯一标识 - 我在数据模型中称之为维度。在我的示例中,我有两个 series
,根据它们覆盖的地理区域进行区分。任何 series
都可以有任意数量的维度。 series
应该从它的维度中查询,维度也会使用 series_id
查询。显而易见的解决方案是桥接 table:
CREATE TABLE series_dimension
series_id INT REFERENCES series(series_id) ON DELETE CASCADE,
dimension_id INT REFERENCES dimension(dimension_id)
);
但是,此解决方案允许以下情况:
|--------------------------|
| series_dimension |
|-----------|--------------|
| series_id | dimension_id |
|-----------|--------------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 2 | 4 |
|-----------|--------------|
也就是说,两个不同的 series
具有相同的维度,所以如果我查询 series
给定的一组维度,我无法决定维度 [=25] =] 我是在寻找 series_id
=1 还是 series_id
=2 是 unacceptable。因此,在这种情况下,我是否必须在具有参照完整性和唯一性 属性我刚才解释的之间做出选择?
我的结论是这种关系(其中列指的是事先不知道其编号的属性)需要规范化导致创建多对多或一对多关系,这排除了独特的映射。
相反,对于列引用其编号事先未知的属性的关系,建立关系 one-to-one/unique 的方法是将这些属性分组到违反 1NF 的唯一子集中.
- 只有一种方法可以指定
UNIQUE
约束,那就是在列上 - 我的示例要求每个
series_id
引用 columns/dimensions 的 变量数
- 因此,我将列堆叠成行,结果
UNIQUE
结构不可用 - 解决方案中每个
series_id
都与指定行子集的数组相关,我现在可以指定此数组列为UNIQUE
- 这违反了1NF,因此这种关系无法正常化
考虑到您对大约 20
尺寸的期望,示例仅限于 60
。它确实需要一个受控过程来定义每个 维度集 (系列)。
推理
-- DIM is a valid numeric identifier for a dimension.
--
valid_dim {DIM}
PK {DIM}
CHECK ((DIM = 1) OR ((DIM > 1) AND (mod(DIM,2) = 0)))
-- data sample
(DIM)
---------
(2^0)
, (2^1)
, (2^2)
, ...
, (2^58)
, (2^59)
-- Dimension DIM, named DIM_NAME exists.
--
dimension {DIM, DIM_NAME}
PK {DIM}
AK {DIM_NAME}
FK {DIM} REFERENCES valid_dim {DIM}
-- data sample
(DIM, DIM_NAME)
---------------
(2^0, 'FREQ')
, (2^1, 'GEOG_AREA')
, (2^2, 'UNIT')
, ...
, (2^58, 'AGE_GROUP')
, (2^59, 'HAIR_COLOR')
加载 series
和 ser_dim
可以从函数、应用程序或其他任何地方完成。但是,这应该是一个受控过程。
SER
对于给定的一组维度是唯一的。
请注意 |
是按位 OR
运算符。
-- Series SER, named SER_NAME exists.
--
series {SER, SER_NAME}
PK {SER}
AK {SER_NAME}
-- data sample
(SER, SER_NAME)
--------------------------------
((2^0 | 2^1 | 2^2) , 'F-G-U')
, ((2^1 | 2^58) , 'G-A' )
, ((2^0 | 2^58 | 2^59), 'F-A-H')
-- Series SER has dimension DIM.
--
ser_dim {SER, DIM}
PK {SER, DIM}
FK1 {SER} REFERENCES series {SER}
FK2 {DIM} REFERENCES dimension {DIM}
CHECK ((DIM & SER) = DIM)
-- data sample
(SER, DIM)
--------------------------------
((2^0 | 2^1 | 2^2) , 2^0)
, ((2^0 | 2^1 | 2^2) , 2^1)
, ((2^0 | 2^1 | 2^2) , 2^2)
, ((2^1 | 2^58) , 2^1 )
, ((2^1 | 2^58) , 2^58)
, ((2^0 | 2^58 | 2^59), 2^0)
, ((2^0 | 2^58 | 2^59), 2^58)
, ((2^0 | 2^58 | 2^59), 2^59)
注:
All attributes (columns) NOT NULL
PK = Primary Key
AK = Alternate Key (Unique)
FK = Foreign Key
PostgreSQL
-- DIM is a valid numeric identifier
-- for a dimension.
--
CREATE TABLE valid_dim (
DIM bigint NOT NULL
, CONSTRAINT pk_valid_dim PRIMARY KEY (DIM)
, CONSTRAINT chk_valid_dim
CHECK ( (DIM = 1)
OR ( (DIM > 1)
AND (mod(DIM, 2) = 0) )
)
);
-- define some of valid DIMs
INSERT INTO valid_dim (DIM)
VALUES
((2^ 0)::bigint)
, ((2^ 1)::bigint)
, ((2^ 2)::bigint)
-- fill this gap
, ((2^58)::bigint)
, ((2^59)::bigint) ;
-- Dimension DIM, named DIM_NAME exists.
--
CREATE TABLE dimension (
DIM bigint NOT NULL
, DIM_NAME text NOT NULL
, CONSTRAINT pk_dim PRIMARY KEY (DIM)
, CONSTRAINT ak_dim UNIQUE (DIM_NAME)
, CONSTRAINT
fk_dim FOREIGN KEY (DIM)
REFERENCES valid_dim (DIM)
);
-- define few dimensions
INSERT INTO dimension (DIM, DIM_NAME)
VALUES
((2^ 0)::bigint, 'FREQ')
, ((2^ 1)::bigint, 'GEOG_AREA')
, ((2^ 2)::bigint, 'UNIT')
, ((2^58)::bigint, 'AGE_GROUP')
, ((2^59)::bigint, 'HAIR_COLOR') ;
-- Series SER, named SER_NAME exists.
--
CREATE TABLE series (
SER bigint NOT NULL
, SER_NAME text NOT NULL
, CONSTRAINT pk_series PRIMARY KEY (SER)
, CONSTRAINT ak_series UNIQUE (SER_NAME)
);
-- define three series
INSERT INTO series (SER, SER_NAME)
SELECT bit_or(DIM) as SER, 'F-G-U' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT')
UNION
SELECT bit_or(DIM) as SER, 'G-A' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('GEOG_AREA', 'AGE_GROUP')
UNION
SELECT bit_or(DIM) as SER, 'F-A-H' as SER_NAME
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'AGE_GROUP', 'HAIR_COLOR') ;
-- Series SER has dimension DIM.
--
CREATE TABLE ser_dim (
SER bigint NOT NULL
, DIM bigint NOT NULL
, CONSTRAINT pk_ser_dim PRIMARY KEY (SER, DIM)
, CONSTRAINT
fk1_ser_dim FOREIGN KEY (SER)
REFERENCES series (SER)
, CONSTRAINT
fk2_ser_dim FOREIGN KEY (DIM)
REFERENCES dimension (DIM)
, CONSTRAINT
chk_ser_dim CHECK ((DIM & SER) = DIM)
);
-- populate ser_dim
INSERT INTO ser_dim (SER, DIM)
SELECT SER, DIM
FROM series
JOIN dimension ON true
WHERE (DIM & SER) = DIM ;
另一种选择是为 ser_dim
使用(物化)视图。这取决于模型的其余部分:如果需要 FK
来 {SER, DIM}
保留 table,否则视图会更好。
-- An option, instead of the table.
--
CREATE VIEW ser_dim
AS
SELECT SER, DIM
FROM series
JOIN dimension ON true
WHERE (DIM & SER) = DIM ;
测试
-- Show already defined series
-- and their dimensions.
SELECT SER_NAME, DIM_NAME
FROM ser_dim
JOIN series USING (SER)
JOIN dimension USING (DIM)
ORDER BY SER_NAME, DIM_NAME ;
-- Get SER for a set of dimensions;
-- use this when defining a series.
SELECT bit_or(DIM) AS SER
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT') ;
-- Find already defined series,
-- given a set of dimensions.
SELECT x.SER
FROM (
SELECT bit_or(DIM) AS SER
FROM dimension
WHERE DIM_NAME IN ('FREQ', 'GEOG_AREA', 'UNIT')
) AS x
WHERE EXISTS
(SELECT 1 FROM series AS s WHERE s.SER = x.SER) ;
总结
不幸的是,标准 SQL 实现不支持断言、数据库范围的约束。 SQL 标准实际上定义了它们,但还没有运气。因此,并非每个业务约束都可以在 SQL 中优雅地完成,通常需要一些创造力和妥协。