SAS删除和分组依据

Question

我拥有的数据集的简化版本是：

DATA HAVE;
INPUT ID match1 $ match2 $ not_relevant;
DATALINES;
1 "ABC" "ABC" 4
1 "XYZ" "XYZ" 29
2 "QQQ" "AAA" 5
2 "ABC" "ABC" 9
3 "EFG" "EFG" 7
3 "DEF" "DEF" 12
3 "LMK" LMK" 16
3 "LMK" . 29


;RUN;

我正在比较 match1 和 match2，如果 ID 列 match1 中的任何地方不等于 match2，我想删除具有该 ID 的所有行。因此对于这个示例数据集，我想删除所有 ID 2（第 3 行和第 4 行），因为第 3 行在 match1 和 match2 之间没有匹配项。到目前为止我能弄清楚如何做的就是删除它们不匹配的行，这对这个应用程序不是很有帮助。我认为将它变成一个带有某些位置的新数据集会更容易，但我不确定如何从那里开始。有什么想法/建议吗？

编辑：抱歉，我简化了我的数据集，忘记了一个重要的异常。请注意我的新数据集（我只在末尾添加了一行）。我不想删除第 3 组，因为 match2 是空白的。我只想删除match2不为空且match1不等于match2的组。

谢谢

Answer 1

使用带有 GROUP BY 和 HAVING 子句的 SQL 查询很容易做到这一点。

proc sql;
create table want as
  select * 
  from have
  group by id
  having max( (match1 ne match2) and not missing(match2))
;
quit;

SAS 将 TRUE/FALSE 的布尔表达式计算为 1/0，因此如果其中任何一个为 TRUE，则一系列 TRUE/FALSE 值的 MAX() 将为 TRUE。

Answer 2

有几种方法可以做到这一点。一种方法是仅构建具有 non-matching 行的 ID 数据集，然后进行合并或 SQL 连接并删除与该列表匹配的任何内容。

然而，我的首选（部分原因是速度，而且一旦你理解它是如何工作的，它会更直接）是 DoW loop.

data want;
  id_nonmatch = 0;
  do _n_ = 1 by 1 until (last.id);
    set have;
    by id;
    if match1 ne match2 then id_nonmatch = 1;   *set the flag to 1 if we find a nonmatch;
  end;
  
  do _n_ = 1 by 1 until (last.id);
    set have;
    by id;    
    if id_nonmatch = 0 then output;
  end;
run;

data step上有两个set语句，每一个都分别运行同一个数据集。如果它没有意义，请在每个 do 循环中添加一个 put _all_; - 这将向您展示它在做什么。第一个循环遍历一个 ID 的所有行，检查是否有任何违反约束，如果 none 违反，则标志变量 (id_nonmatch) 保持为 0。如果违反，则变为 1 （并保持这种状态）。然后，当它达到 ID 边界时，它停止从第一个 set 语句中提取记录，并转到第二个 - re-pulling 那些相同的行。现在，它仅在标志为零时输出。

这是非常有效的，因为有缓冲 - 除非你的 id 组非常大，否则数据步骤可以使用缓冲区将相同的行保存在内存中，而不必从磁盘重新读取它们。（这将取决于您的磁盘和缓冲区 - 并且似乎对闪存的帮助远小于对物理磁盘的帮助[因为没有磁盘头不必移动的额外好处] - 所以你的里程可能会有所不同。）

为了显示这种差异，这里有一个日志显示第二次读取不需要太多额外时间 - 当记录大小合理时。当记录很小时，这种好处会减少——我想这会涉及更多的开销。请注意，第二次读取仅将第一次读取的时间增加到总处理时间的 1/7！

 69         data have;
 70           call streaminit(7);
 71           length strvar 00;
 72           do id = 1 to 100000;
 73             do iter = 1 to 50;
 74               x = rand('Uniform');
 75               output;
 76             end;
 77           end;
 78         run;
 
 NOTE: Variable strvar is uninitialized.
 NOTE: The data set WORK.HAVE has 5000000 observations and 4 variables.
 NOTE: DATA statement used (Total process time):
       real time           5.20 seconds
       cpu time            5.20 seconds
       
 
 79         
 80         
 81         data _null_;
 82           do _n_ = 1 by 1 until (last.id);
 83             set have;
 84             by id;
 85           end;
 86         run;
 
 NOTE: There were 5000000 observations read from the data set WORK.HAVE.
 NOTE: DATA statement used (Total process time):
       real time           2.37 seconds
       cpu time            2.37 seconds
       
 
 87         
 88         
 89         data _null_;
 90           do _n_ = 1 by 1 until (last.id);
 91             set have;
 92             by id;
 93           end;
 94           do _n_ = 1 by 1 until (last.id);
 95             set have;
 96             by id;
 97           end;
 98         run;
 
 NOTE: There were 5000000 observations read from the data set WORK.HAVE.
 NOTE: There were 5000000 observations read from the data set WORK.HAVE.
 NOTE: DATA statement used (Total process time):
       real time           2.74 seconds
       cpu time            2.73 seconds

SAS删除和分组依据

SAS delete and group by

sql

group-by

sas