SAS:模糊连接
SAS: Fuzzy Joins
我有以下 SQL 查询我在 SAS 中 运行:
proc sql;
create table my_table as
select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3 and a.id1 = b.id1)
or a.id2 = b.id2;
quit;
我的问题: 我正在尝试将“a.id1 = b.id1”替换为“a.id1 FUZZY EQUAL a.id1”(https://www.ibm.com/docs/en/psfa/7.2.1?topic=functions-fuzzy-string-search), and have this an "explicit pass-through" (https://www.lexjansen.com/mwsug/2013/RF/MWSUG-2013-RF02.pdf) :
proc sql;
connect to netezza(server = &abc database = &abc user =&abc password = &abc bunkload = yes);
create table my_table as
select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3 where le_dst(a.id1, b.id1) = 1 )
or a.id2 = b.id2;
quit;
但我是 SAS 新手,不知道如何正确执行此操作(表在 netezza 上)。
谁能告诉我怎么做?是否有其他常见的“模糊连接”函数非常适合此类问题?
谢谢!
注1:表格如下所示:
> table_a
id1 id2 date_1
1 123 A 11 2010-01-31
2 123BB 12 2010-01-31
3 12 5 14 2015-01-31
4 12--5 13 2018-01-31
> table_b
id1 id2 date_2 date_3
1 0123 111 2009-01-31 2011-01-31
2 1233 112 2010-01-31 2010-01-31
3 125 . 14 2010-01-31 2020-01-31
4 125_ 113 2010-01-31 2020-01-31
注 2: 用于为本示例创建这些表的 R 代码(在我原来的问题中,日期出现在 R 中的“因子”变量类型中):
table_a = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_a$id1 = as.factor(table_a$id1)
table_a$id2 = as.factor(table_a$id2)
table_a$date_1 = as.factor(table_a$date_1)
table_b = data.frame(id1 = c("0123", "1233", "125 .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_b$id1 = as.factor(table_b$id1)
table_b$id2 = as.factor(table_b$id2)
table_b$date_2 = as.factor(table_b$date_2)
table_b$date_3 = as.factor(table_b$date_3)
将 SQL 推入远程数据库。
proc sql;
connect to netezza .... ;
create table sastable as
select * from connection to netezza
(
select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3)
and (le_dst(a.id1, b.id1) = 1 or a.id2 = b.id2)
)
;
quit;
文档的第 258 页 SAS®
9.4 SQL 程序
用户指南,第四
Edition 显示了两种形式的 CONNECT
语句语法:
连接到 dbms-name
<(connect-statement-argument-1=value-1 )>
<(database-connection-argument-1=value-1 )>;
连接使用 libref ;
将斜体替换为您的实际值,<> 之间的项目是可选的。
我有以下 SQL 查询我在 SAS 中 运行:
proc sql;
create table my_table as
select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3 and a.id1 = b.id1)
or a.id2 = b.id2;
quit;
我的问题: 我正在尝试将“a.id1 = b.id1”替换为“a.id1 FUZZY EQUAL a.id1”(https://www.ibm.com/docs/en/psfa/7.2.1?topic=functions-fuzzy-string-search), and have this an "explicit pass-through" (https://www.lexjansen.com/mwsug/2013/RF/MWSUG-2013-RF02.pdf) :
proc sql;
connect to netezza(server = &abc database = &abc user =&abc password = &abc bunkload = yes);
create table my_table as
select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3 where le_dst(a.id1, b.id1) = 1 )
or a.id2 = b.id2;
quit;
但我是 SAS 新手,不知道如何正确执行此操作(表在 netezza 上)。
谁能告诉我怎么做?是否有其他常见的“模糊连接”函数非常适合此类问题?
谢谢!
注1:表格如下所示:
> table_a
id1 id2 date_1
1 123 A 11 2010-01-31
2 123BB 12 2010-01-31
3 12 5 14 2015-01-31
4 12--5 13 2018-01-31
> table_b
id1 id2 date_2 date_3
1 0123 111 2009-01-31 2011-01-31
2 1233 112 2010-01-31 2010-01-31
3 125 . 14 2010-01-31 2020-01-31
4 125_ 113 2010-01-31 2020-01-31
注 2: 用于为本示例创建这些表的 R 代码(在我原来的问题中,日期出现在 R 中的“因子”变量类型中):
table_a = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_a$id1 = as.factor(table_a$id1)
table_a$id2 = as.factor(table_a$id2)
table_a$date_1 = as.factor(table_a$date_1)
table_b = data.frame(id1 = c("0123", "1233", "125 .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_b$id1 = as.factor(table_b$id1)
table_b$id2 = as.factor(table_b$id2)
table_b$date_2 = as.factor(table_b$date_2)
table_b$date_3 = as.factor(table_b$date_3)
将 SQL 推入远程数据库。
proc sql;
connect to netezza .... ;
create table sastable as
select * from connection to netezza
(
select a.*, b.*
from table_a a
inner join table_b b
on (a.date_1 between b.date_2 and b.date_3)
and (le_dst(a.id1, b.id1) = 1 or a.id2 = b.id2)
)
;
quit;
文档的第 258 页 SAS®
9.4 SQL 程序
用户指南,第四
Edition 显示了两种形式的 CONNECT
语句语法:
连接到 dbms-name
<(connect-statement-argument-1=value-1)>
<(database-connection-argument-1=value-1)>; 连接使用 libref
;
将斜体替换为您的实际值,<> 之间的项目是可选的。