从 SAS 库中的所有数据集中提取唯一值

Extracting unique values from all the datasets in a SAS library

我需要提取 SAS 库中所有数据集中一些公共变量的所有 unique/distinct 值。我尝试了以下代码,但是否有更好的方法在一个数据集上使用它。

%macro dslist(); 
proc sql noprint;
select  memname into :mylist separated by ' '
from dictionary.tables where libname= "VIEW" and upcase(memname) like "data_%"
;
quit;

%put &mylist;
  data _null_;
       datanum = countw("&mylist");
       call symput('Dataset', put(datanum, 10.));
  run;
%put #######&Dataset;

proc sql ;
%do i = 1 %to  &Dataset ;
  %let dataname=view.%scan(&mylist,&i,%str( ));
  create table %scan(&mylist,&i,%str( )) as 
   select distinct id,visit 
   from &dataname 
   order by id,visit
  ;
%end;
quit;
%mend;
%dslist;

我在此步骤后使用 proc append 设置所有数据集,然后删除重复项。

另外,如果有人知道Hash方法,效率会更高!

谢谢!

我想知道您的代码实际上是如何处理以下 upcase(memname) like "data_%"

创建虚假数据

libname view "/home/kermit/folder";

data view.data_A;
    call streaminit(123);
    array _{5} $ ('s', 't', 'a', 'c', 'k');

    do i=1 to 100000;
        id=rand("integer", 1, 1000);
        j=rand('integer', 1, dim(_));
        visit=_[j];
        output;
    end;
    drop i j _:;
run;

data view.data_B;
    call streaminit(123);
    array _{5} $ ('s', 't', 'a', 'c', 'k');

    do i=1 to 100000;
        id=rand("integer", 1, 1000);
        j=rand('integer', 1, dim(_));
        visit=_[j];
        output;
    end;
    drop i j _:;
run;

data view.data_C;
    call streaminit(123);
    array _{5} $ ('s', 't', 'a', 'c', 'k');

    do i=1 to 100000;
        id=rand("integer", 1, 1000);
        j=rand('integer', 1, dim(_));
        visit=_[j];
        output;
    end;
    drop i j _:;
run;

合二为一table

proc sql noprint;
select cats(libname,'.',memname,"(keep= id visit)") into :mylist separated by ' '
from dictionary.tables where libname="VIEW" and upcase(memname) like "DATA_%"
;
quit;

data have;
set &mylist.;
run;

提取 idvisit

的所有唯一值
proc sort data=have out=want nodupkey; by id visit; run;
 NOTE: There were 300000 observations read from the data set WORK.HAVE.
 NOTE: 295000 observations with duplicate key values were deleted.
 NOTE: The data set WORK.WANT has 5000 observations and 2 variables.
 NOTE: PROCEDURE SORT a utilisé (Durée totale du traitement) :
       real time           0.08 seconds
       user cpu time       0.14 seconds
       system cpu time     0.02 seconds
       memory              23404.76k
       OS Memory           51740.00k

如果数据集的数量很少,您可能只生成一个 SQL 语句,它将 select 和 de-dup。但是单个 SQL 语句可以引用的表的数量是有限制的。就像您当前代码生成的单个宏变量中可以包含的数据集名称数量有限制一样。

因此,为了制作更强大的东西,您可以使用数据步骤视图将数据和 PROC SORT 组合到 de-dup。

首先获取同时具有 ID 和 VISIT 变量并满足您的其他条件的数据集列表。

proc sql ;
create table dslist as
  select  catx('.',libname,nliteral(memname)) as dsname
  from dictionary.columns
  where libname= "VIEW"
    and memname like %upcase("data_%")
    and upcase(name) in ('ID' 'VISIT')
  group by 1
  having count(*)=2
;
quit;

然后使用该列表定义一个数据步骤视图,该视图仅组合所有这些视图中的 ID 和 VISIT 变量。

filename code temp;
data _null_;
  set dslist end=eof;
  file code lrecl=72;
  if _n_=1 then put 'data id_visit_v / viwe=id_visit_v;' / '  set ' @;
  put dsname '(keep=id visit) ' @;
  if eof then put ';' / 'run;' ;
run;

%include code / source2;

然后使用 PROC SORT 获取一组不同的 ID*VISIT 组合。

proc sort data=id_visit_v nodupkey out=id_visit ;
  by id visit;
run;

清理。

proc delete data=id_visit_v (memtype=view);
run;