如何复制 SAS 合并
How to replicate a SAS merge
我有两个 tables,t1 和 t2:
t1
person | visit | code1 | type1
1 1 50 50
1 1 50 50
1 2 75 50
t2
person | visit | code2 | type2
1 1 50 50
1 1 50 50
1 1 50 50
当 SAS 运行以下代码时:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
它生成以下数据集:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
我想在 SQL 中复制这个过程,我的想法是使用全外连接。这有效,除非有重复的行。当我们像上面的例子一样有重复的行时,完全外部连接会产生以下 table:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
我想知道如何让 SQl table 匹配 SAS table。
您可以通过向每个 table:
添加 row_number()
来复制 SAS 合并
select t1.*, t2.*
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
备注:
??
表示放入用于排序的列。 SAS 数据集具有内在顺序。 SQL table没有,所以需要指定顺序。
- 您应该明确列出列(而不是在外部查询中使用
t1.*, t2.*
)。我认为 SAS 在结果数据集中只包含 person
和 visit
一次。
编辑:
注意:以上内容为键列生成了单独的值。这很容易修复:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.key, t2.key) as key,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
这解决了列问题。您可以使用 first_value()
/last_value()
或使用更复杂的 join
条件来解决复制问题:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
(t1.seqnum = t2.seqnum or
(t1.cnt > t2.cnt and t1.seqnum > t2.seqnum and t2.seqnum = t2.cnt) or
(t2.cnt > t1.cnt and t2.seqnum > t1.seqnum and t1.seqnum = t1.cnt)
这在单个连接中实现了 "keep the last row" 逻辑。可能出于性能原因,您可能希望将其放入原始逻辑的单独 left join
中。
戈登的答案很接近;但它遗漏了一点。这是它的输出:
person visit code1 type1 seqnum person visit code2 type2 seqnum
1 1 1 1 1 1 1 1 1 1
1 1 2 2 2 1 1 2 2 2
NULL NULL NULL NULL NULL 1 1 3 3 3
1 2 1 3 1 NULL NULL NULL NULL NULL
第三行的空值不正确,而第四行是正确的。
据我所知,在 SQL 中,除了将事情分成几个查询之外,没有真正好的方法来执行此操作。我觉得有五种可能:
- 匹配person/visit,匹配seqnums
- 匹配person/visit,左边有更多seqnums
- 匹配person/visit,右边有更多seqnums
- 左边有无双person/visit
- 权已无双person/visit
我认为最后两个可能适用于一个查询,但我认为第二个和第三个必须是单独的查询。当然,您可以将所有内容合并在一起。
所以这是一个示例,使用一些更适合查看正在发生的事情的临时 tables。请注意,第三行现在填写了 code1
和 type1
,即使它们是 'extra'。我只添加了五个标准中的三个 - 您在初始示例中的三个 - 但其他两个并不太难。
请注意,这是一个在 SAS 中 far 更快的示例 - 因为 SAS 具有按行的概念,即它能够一次移动一行。 SQL 在这些方面往往会花费更长的时间,tables 很大,除非可以非常整齐地划分事物并拥有非常好的索引 - 即使那样我也从未见过 SQL 在某些此类问题上,DBA 所做的工作与 SAS 几乎一样。当然,这是您必须接受的 - SQL 有其自身的优势,其中之一可能是价格...
这是我的示例代码。我确信它不是非常优雅,希望 SQL 的人可以改进它。这是为了在 SQL 服务器(使用 table 变量)中工作而编写的,同样的事情应该在其他变体中进行一些更改(使用临时 tables),假设它们实现窗口化。 (SAS 当然不能做这件事——因为即使 FedSQL 实现的是 ANSI 1999,而不是 ANSI 2008。)这是基于 Gordon 的初始查询,然后在末尾用附加位进行修改。任何想要对此进行改进的人,请随时编辑 and/or 复制到 new/existing 任何你想要的答案。
declare @t1 table (person INT, visit INT, code1 INT, type1 INT);
declare @t2 table (person INT, visit INT, code2 INT, type2 INT);
insert into @t1 values (1,1,1,1)
insert into @t1 values (1,1,2,2)
insert into @t1 values (1,2,1,3)
insert into @t2 values (1,1,1,1)
insert into @t2 values (1,1,2,2)
insert into @t2 values (1,1,3,3)
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select *,
row_number() over (partition by person, visit order by type1) as seqnum
from @t1
) t1 inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from @t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum
union all
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (
(select person, visit, MAX(seqnum) as max_rownum from (
select person, visit,
row_number() over (partition by person, visit order by type1) as seqnum
from @t1) t1_f
group by person, visit
) t1_m inner join
(select *, row_number() over (partition by person, visit order by type1) as seqnum
from @t1
) t1
on t1.person=t1_m.person and t1.visit=t1_m.visit
and t1.seqnum=t1_m.max_rownum
inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from @t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum < t2.seqnum
)
union all
select t1.person, t1.visit, t1.code1, t1.type1, t2.code2, t2.type2
from @t1 t1 left join @t2 t2
on t2.person=t1.person and t2.visit=t1.visit
where t2.code2 is null
我有两个 tables,t1 和 t2:
t1
person | visit | code1 | type1
1 1 50 50
1 1 50 50
1 2 75 50
t2
person | visit | code2 | type2
1 1 50 50
1 1 50 50
1 1 50 50
当 SAS 运行以下代码时:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
它生成以下数据集:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
我想在 SQL 中复制这个过程,我的想法是使用全外连接。这有效,除非有重复的行。当我们像上面的例子一样有重复的行时,完全外部连接会产生以下 table:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
我想知道如何让 SQl table 匹配 SAS table。
您可以通过向每个 table:
添加row_number()
来复制 SAS 合并
select t1.*, t2.*
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
备注:
??
表示放入用于排序的列。 SAS 数据集具有内在顺序。 SQL table没有,所以需要指定顺序。- 您应该明确列出列(而不是在外部查询中使用
t1.*, t2.*
)。我认为 SAS 在结果数据集中只包含person
和visit
一次。
编辑:
注意:以上内容为键列生成了单独的值。这很容易修复:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.key, t2.key) as key,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
这解决了列问题。您可以使用 first_value()
/last_value()
或使用更复杂的 join
条件来解决复制问题:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
(t1.seqnum = t2.seqnum or
(t1.cnt > t2.cnt and t1.seqnum > t2.seqnum and t2.seqnum = t2.cnt) or
(t2.cnt > t1.cnt and t2.seqnum > t1.seqnum and t1.seqnum = t1.cnt)
这在单个连接中实现了 "keep the last row" 逻辑。可能出于性能原因,您可能希望将其放入原始逻辑的单独 left join
中。
戈登的答案很接近;但它遗漏了一点。这是它的输出:
person visit code1 type1 seqnum person visit code2 type2 seqnum
1 1 1 1 1 1 1 1 1 1
1 1 2 2 2 1 1 2 2 2
NULL NULL NULL NULL NULL 1 1 3 3 3
1 2 1 3 1 NULL NULL NULL NULL NULL
第三行的空值不正确,而第四行是正确的。
据我所知,在 SQL 中,除了将事情分成几个查询之外,没有真正好的方法来执行此操作。我觉得有五种可能:
- 匹配person/visit,匹配seqnums
- 匹配person/visit,左边有更多seqnums
- 匹配person/visit,右边有更多seqnums
- 左边有无双person/visit
- 权已无双person/visit
我认为最后两个可能适用于一个查询,但我认为第二个和第三个必须是单独的查询。当然,您可以将所有内容合并在一起。
所以这是一个示例,使用一些更适合查看正在发生的事情的临时 tables。请注意,第三行现在填写了 code1
和 type1
,即使它们是 'extra'。我只添加了五个标准中的三个 - 您在初始示例中的三个 - 但其他两个并不太难。
请注意,这是一个在 SAS 中 far 更快的示例 - 因为 SAS 具有按行的概念,即它能够一次移动一行。 SQL 在这些方面往往会花费更长的时间,tables 很大,除非可以非常整齐地划分事物并拥有非常好的索引 - 即使那样我也从未见过 SQL 在某些此类问题上,DBA 所做的工作与 SAS 几乎一样。当然,这是您必须接受的 - SQL 有其自身的优势,其中之一可能是价格...
这是我的示例代码。我确信它不是非常优雅,希望 SQL 的人可以改进它。这是为了在 SQL 服务器(使用 table 变量)中工作而编写的,同样的事情应该在其他变体中进行一些更改(使用临时 tables),假设它们实现窗口化。 (SAS 当然不能做这件事——因为即使 FedSQL 实现的是 ANSI 1999,而不是 ANSI 2008。)这是基于 Gordon 的初始查询,然后在末尾用附加位进行修改。任何想要对此进行改进的人,请随时编辑 and/or 复制到 new/existing 任何你想要的答案。
declare @t1 table (person INT, visit INT, code1 INT, type1 INT);
declare @t2 table (person INT, visit INT, code2 INT, type2 INT);
insert into @t1 values (1,1,1,1)
insert into @t1 values (1,1,2,2)
insert into @t1 values (1,2,1,3)
insert into @t2 values (1,1,1,1)
insert into @t2 values (1,1,2,2)
insert into @t2 values (1,1,3,3)
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select *,
row_number() over (partition by person, visit order by type1) as seqnum
from @t1
) t1 inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from @t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum
union all
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (
(select person, visit, MAX(seqnum) as max_rownum from (
select person, visit,
row_number() over (partition by person, visit order by type1) as seqnum
from @t1) t1_f
group by person, visit
) t1_m inner join
(select *, row_number() over (partition by person, visit order by type1) as seqnum
from @t1
) t1
on t1.person=t1_m.person and t1.visit=t1_m.visit
and t1.seqnum=t1_m.max_rownum
inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from @t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum < t2.seqnum
)
union all
select t1.person, t1.visit, t1.code1, t1.type1, t2.code2, t2.type2
from @t1 t1 left join @t2 t2
on t2.person=t1.person and t2.visit=t1.visit
where t2.code2 is null