使用具有缺失值的 HASH 进行左连接，以替换为 SAS 中的最新可用值

Question

我将不得不加入以下两个table。

Table一个（10亿行）

input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

Table两个（3200万行）

Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

我想获得加入后的 table，如下所示；

snapshotdate sourcekey sourcesystemid value1  NPL
20200101      112         5            788    999
20200102      112         5            789    999
20200103      112         5            800    999
20200201      112         5            786    988
20200202      112         5            777    988
20200203      112         5            834    988
20200301      112         5            789    988
20200302      112         5            771    988
20200303      112         5            832    988
20200101      222         6            788    555
20200102      222         6            789    555
20200103      222         6            800    555
20200201      222         6            786    556
20200202      222         6            777    556
20200203      222         6            834    556
20200301      222         6            789    556
20200302      222         6            771    556
20200303      222         6            832    556

当有缺失（年月）时，必须用最新可用的值补上，我目前拥有的代码：（无法替换缺失值）。

Proc SQL;
Create Table want as
Select 
a.*,
b.npl
from one as a
left join two as b
on a.sourcekey =b.sourcekey and a.sourcesystemid = b.sourcesystemid and input(substr(put(a.snapshotdate,8.),1,6),6.) = b.period
order by a.sourcekey,a.snapshotdate
;
Quit;

因为涉及到大table，所以我更喜欢使用HASH编码。我想使用 table 两个作为 HASH 对象。

提前致谢。

Answer 1

如果我对你的问题理解正确，并且你的记忆力足够，你可以这样做

data want(drop=period rc);
   if _N_ = 1 then do;
      dcl hash h(dataset : "two");
      h.definekey("sourcekey", "sourcesystemid", "period");
      h.definedata("npl");
      h.definedone();
   end;

   set one;
   if 0 then set two;

   rc = h.find(key : sourcekey, key : sourcesystemid, key : int(snapshotdate/100));
run;

使用具有缺失值的 HASH 进行左连接，以替换为 SAS 中的最新可用值

Left join using HASH with missing values to be replaced with latest available value in SAS

hashtable

sas

left-join