SAS 通过重新出现记录块来删除重复项
SAS removing duplicates by the reappearance of chunks of records
我有一个数据集,其中重复出现了相同长度的观察组中的数据块,例如:
data have;
input name $ identifier ;
cards;
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
adam 2
adam 3
adam 3
adam 7
/*remove*/
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
/*remove*/
adam 8
mary 1
mary 2
mary 3
mary 4
mary 5
mary 7
mary 6
adam 9
mary 1
mary 2
mary 3
;
我希望用有序标识符删除由 /remove/ 标记的 mary 重现块。结果应如下所示:
mary 1
mary 2
mary 4
mary 5
mary 6
mary 7
adam 2
adam 3
adam 7
adam 8
mary 1
mary 2
mary 3
mary 4
mary 5
mary 6
mary 7
adam 9
mary 1
mary 2
mary 3
感谢您的帮助!有人提出了Hash table的方法,但我怀疑我可能没有足够的内存来处理代码。这可以通过 datasteps 或 proc sql?
完成吗
如果每组的最大记录数足够小,那么您可以使用以下方法构建一个包含组中标识符列表的字符串,并将其用作 HASH 中的键之一。
data want ;
do until (last.name);
set have ;
by name notsorted ;
length taglist 0 ;
taglist=catx('|',taglist,identifier);
end;
if _n_=1 then do;
dcl hash h();
h.defineKey('name','taglist');
h.defineDone();
end;
found = 0 ne h.add();
do until (last.name);
set have ;
by name notsorted ;
if not found then output;
end;
drop found taglist;
run;
如果键太大而无法放入哈希对象中,那么您将需要进行多次传递。先找群。然后找到每种类型的组的第一次出现。然后为这些组生成数据。
data pass1 ;
group + 1;
first_obs=row+1;
do until (last.name);
set have ;
by name notsorted ;
length taglist 0 ;
taglist=catx('|',taglist,identifier);
row+1;
end;
last_obs=row;
output;
keep group name taglist first_obs last_obs;
run;
proc sql ;
create table pass2 as
select group,first_obs,last_obs
from pass1
group by name,taglist
having min(group)=group
order by group
;
quit;
data want;
set pass2;
do obs=first_obs to last_obs;
set have point=obs;
output;
end;
drop /*group*/ first_obs last_obs ;
run;
结果:
Obs group name identifier
1 1 mary 1
2 1 mary 2
3 1 mary 2
4 1 mary 4
5 1 mary 5
6 1 mary 7
7 1 mary 6
8 2 adam 2
9 2 adam 3
10 2 adam 3
11 2 adam 7
12 4 adam 8
13 5 mary 1
14 5 mary 2
15 5 mary 3
16 5 mary 4
17 5 mary 5
18 5 mary 7
19 5 mary 6
20 6 adam 9
21 7 mary 1
22 7 mary 2
23 7 mary 3
我有一个数据集,其中重复出现了相同长度的观察组中的数据块,例如:
data have;
input name $ identifier ;
cards;
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
adam 2
adam 3
adam 3
adam 7
/*remove*/
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
/*remove*/
adam 8
mary 1
mary 2
mary 3
mary 4
mary 5
mary 7
mary 6
adam 9
mary 1
mary 2
mary 3
;
我希望用有序标识符删除由 /remove/ 标记的 mary 重现块。结果应如下所示:
mary 1
mary 2
mary 4
mary 5
mary 6
mary 7
adam 2
adam 3
adam 7
adam 8
mary 1
mary 2
mary 3
mary 4
mary 5
mary 6
mary 7
adam 9
mary 1
mary 2
mary 3
感谢您的帮助!有人提出了Hash table的方法,但我怀疑我可能没有足够的内存来处理代码。这可以通过 datasteps 或 proc sql?
完成吗如果每组的最大记录数足够小,那么您可以使用以下方法构建一个包含组中标识符列表的字符串,并将其用作 HASH 中的键之一。
data want ;
do until (last.name);
set have ;
by name notsorted ;
length taglist 0 ;
taglist=catx('|',taglist,identifier);
end;
if _n_=1 then do;
dcl hash h();
h.defineKey('name','taglist');
h.defineDone();
end;
found = 0 ne h.add();
do until (last.name);
set have ;
by name notsorted ;
if not found then output;
end;
drop found taglist;
run;
如果键太大而无法放入哈希对象中,那么您将需要进行多次传递。先找群。然后找到每种类型的组的第一次出现。然后为这些组生成数据。
data pass1 ;
group + 1;
first_obs=row+1;
do until (last.name);
set have ;
by name notsorted ;
length taglist 0 ;
taglist=catx('|',taglist,identifier);
row+1;
end;
last_obs=row;
output;
keep group name taglist first_obs last_obs;
run;
proc sql ;
create table pass2 as
select group,first_obs,last_obs
from pass1
group by name,taglist
having min(group)=group
order by group
;
quit;
data want;
set pass2;
do obs=first_obs to last_obs;
set have point=obs;
output;
end;
drop /*group*/ first_obs last_obs ;
run;
结果:
Obs group name identifier
1 1 mary 1
2 1 mary 2
3 1 mary 2
4 1 mary 4
5 1 mary 5
6 1 mary 7
7 1 mary 6
8 2 adam 2
9 2 adam 3
10 2 adam 3
11 2 adam 7
12 4 adam 8
13 5 mary 1
14 5 mary 2
15 5 mary 3
16 5 mary 4
17 5 mary 5
18 5 mary 7
19 5 mary 6
20 6 adam 9
21 7 mary 1
22 7 mary 2
23 7 mary 3