计算分组行之间的最大差异

Calculate maximum difference between grouped rows

我有以下数据,其中家庭成员按年龄(从大到小)排序:

data houses;             
input HouseID PersonID Age;       
datalines;              
1 1 25                    
1 2 20                   
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;

我想为每个家庭计算连续老年人之间的最大年龄差异。因此,此示例将为家庭 1、2 和 3 连续给出 5 (=25-20)、16 (=32-16) 和 32 (=42-10) 的值。

我可以使用大量合并来做到这一点(即提取人 1,与人 2 的提取物合并,等等),但是由于一个家庭中最多可以有 20 多人,所以我正在寻找更多更直接的方法。

proc sort data=houses; by houseid personid age;run;

data _t1;
set houses;
diff = dif1(age) * (-1);
if personid = 1 then diff = .;
run;


proc sql;
create table want as 
select houseid, max(diff) as Max_Diff
from _t1
group by houseid;
proc sort data = house;
 by houseid descending age;
run;

data house;
set house;
by houseid;
lag_age = lag1(age);
if first.houseid then age_diff = 0;
age_diff = lag_age - age;
run;

proc sql;
 select houseid,max(age_diff) as max_age_diff
 from house
 group by houseid;
quit;

工作:

首先使用houseid和降序年龄对数据集进行排序。 第二个数据步骤将计算当前年龄值(在 PDV 中)与之前在 PDV 中的年龄值之间的差异。然后,使用 sql 过程,我们可以获得每个 houseid 的最大年龄差异。

这是一个两次通过的解决方案。与上述两种解决方案的第一步相同,按年龄排序。在第二步中,每行跟踪 max_diff,在 HouseID 的最后一条记录处输出结果。这导致仅两次通过数据。

proc sort data=houses; by houseid age;run;

data want;
set houses;
by houseID;

retain max_diff 0;

diff = dif1(age)*-1;

if first.HouseID then do;
    diff = .; max_diff=.;
end;

if diff>max_diff then max_diff=diff;
if last.houseID then output;

keep houseID max_diff;
run; 

再加入一个。这是 Reeza 回复的浓缩版。

/* No need to sort by PersonID as age is the only concern */
proc sort data = houses;
    by HouseID Age;
run;
data want;
    set houses;
    by HouseID;
    /* Keep the diff when a new row is loaded */
    retain diff;
    /* Only replace the diff if it is larger than previous */
    diff = max(diff, abs(dif(Age)));
    /* Reset diff for each new house */
    if first.HouseID then diff = 0;
    /* Only output the final diff for each house */
    if last.HouseID;
    keep HouseID diff;
run;

这里是一个使用 FIRST. and LAST. 并一次(排序后)遍历数据的示例。

data houses;             
 input HouseID PersonID Age;       
 datalines;              
1 1 25                    
1 2 20                   
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;

Proc sort data=HOUSES;
 by houseid descending age ;
run;

Data WANT(keep=houseid max_diff);
 format houseid max_diff;
 retain max_diff age1 age2;
 Set HOUSES;

 by houseid descending age ;

 if first.houseid and last.houseid then do;
  max_diff=0;
  output;
 end;
 else if first.houseid then do;
  call missing(max_diff,age1,age2);
  age1=age;
 end;
 else if not(first.houseid or last.houseid) then do;
  age2=age;
  temp=age1-age2;
  if temp>max_diff then max_diff=temp;
  age1=age;  
 end;
 else if last.houseid then do;
  age2=age;
  temp=age1-age2;
  if temp>max_diff then max_diff=temp;
  output;
 end;
Run;