SAS - PROC SQL - 将值汇总到唯一列中

SAS - PROC SQL - Sum values into unique columns

简化我的描述 table 以使我的问题简洁...

我有一个包含 3 列的数据集。第一列包含 100 成本类别(即唯一键),第二列包含 成本 对于给定的成本类别,第三个包含 售出单位 .

我的目标是将其变成一个 table,每个 CostCat 都有一列,其中包含 [该给定类别的 =23=]Cost 字段,按 UnitsSold.
分组

╔════════════╦══════════╦══════════╦═══════
║  UnitsSold ║ CatCost1 ║ CatCost2 ║ CostCat...
╠════════════╬══════════╬══════════╬═══════
║    1       ║    50    ║    10    ║ ...
║    2       ║    20    ║    15    ║ ...
║    ...     ║    ...   ║    ...   ║ ...
╚════════════╩══════════╩══════════╩═══════

我倾向于使用这样的代码:

PROC SQL;
CREATE TABLE cartesian AS
SELECT
  UnitsSold,
  SUM(CASE WHEN CostCat=1 THEN Cost else 0 end) as CostCat1,
  sum(case when CostCat=1 then Cost else 0 end) as CostCat2,
  sum(case when CostCat=1 then Cost else 0 end) as CostCat3,
  ...
  sum(case when CostCat=100 then Cost else 0 end) as CostCat100
GROUP BY UnitsSold;
QUIT;

我想知道是否有比编写大量可笑的 CASE 语句更有效的方法? (显然使用 Excel 来生成实际的打字)。

我想可能存在某种类型的宏循环,但对宏还不够熟悉,不知道如何执行此操作。

我传统上使用 PROC SQL,所以这是我的首选,但也对 SAS 代码解决方案持开放态度

正如 Reeza 指出的那样,最好的方法可能是通过 proc sqlproc means/summaryproc transpose 的组合。我假设你知道 SQL 所以我会先进入那个描述。

proc sql;
create table tmp as
select UnitsSold, CostCat, sum(cost) as cost
from have
group by UnitsSold, CostCat;
quit;

如果您想通过 SAS 过程执行此操作,您可以使用 proc summary

proc summary data=have nway missing;
class UnitsSold CostCat;
var Cost;
output out=tmp(drop=_:) sum=;  ** drop=_: removes the automatic variables created in the procedure;
run;

现在 table 已按 UnitsSoldCostCat 汇总和排序,您可以转置 table。

proc transpose data=tmp out=want(drop=_NAME_) prefix=CostCat;
by UnitsSold;
id CostCat;
var cost;
run;

迈克尔:

问题是描述 PIVOT 操作,在 SAS 术语中也称为 TRANSPOSE,Paste/Special 在 Excel 中称为转置或 PIVOT table。

如果您坚持使用 Proc SQL 语句,则没有 PIVOT 运算符。 SQL 服务器和其他数据库确实有 PIVOT 运算符。但假设您坚持使用 SAS Proc SQL。您是正确的,您将需要那么多 CASE 语句来创建跨变量。

在 SAS 中有多种数据透视方法。这里有六种方法:

示例数据

data have;
  do row = 1 to 500;
    cost_cat = ceil(100 * ranuni(123));
    cost = 10 + floor(50 * ranuni(123));
    units_sold = floor (20 * ranuni(123));
    output;
  end;
run;

方式 1 - Proc TRANSPOSE:仅用于展示的枢轴

Class变量在table语句中用于布局行和列。

proc tabulate data=have;
  class cost_cat units_sold;
  var cost;
  table units_sold, cost_cat*cost*sum / nocellmerge;
run;

方式 2 - Proc REPORT:仅用于展示的枢轴

成本类别和成本列堆叠在一起。 Cost 没有 define 语句,默认为 display sum。对每个组中的成本超过值执行总和 * across:

proc report data=have;
  columns units_sold (cost_cat, cost) ;
  define units_sold / group;
  define cost_cat / across;
run;

方式 3 - Proc MEANS + Proc TRANSPOSE:数据的枢轴

Transpose 将创建一个包含列 'out of order' 的数据集,因为这些列是按照您逐步执行 units_solds.
[= 时 id 值出现的顺序创建的70=]这可以通过向 have 添加额外的数据来防止。数据将具有 units_sold = -1 并且每个 cost_cat 值都有一行。额外的组作为 TRANSPOSE out= 数据集选项的一部分被删除——例如:(... where=(units_sold ne -1))

proc means noprint data=have;
  class units_sold cost_cat;
  var cost;
  ways 2;
  output sum=sum out=haveMeans ;
run;

proc transpose data=haveMeans out=wantAcross1(drop=_name_) prefix=cost_sum_;
  by units_sold;
  var sum;
  id cost_cat;
  ;  
run;

方式4 - SQL宏生成的`wallpaper`代码:特定于一个数据集

宏更简单,因为它特定于所讨论的数据集。 对于更一般的情况,语句生成的显着方面可以被抽象化并进一步宏化(参见方式 5)

%macro pivot_across;
  %local i;

  proc sql;
    create table wantAcross2 as
    select units_sold
    %do i = 1 %to 100;  %* codegen many sql select expressions;
    , sum ( case when cost_cat = &i then cost else 0 end ) as cost_sum_&i
    %end;
    from have
    group by units_sold;
  quit;
%mend;

%pivot_across;

提示:通过一些更改,代码生成可以 Proc SQL 直通并远程执行枢轴。

方式5 - SQL宏生成的`wallpaper`代码:任意数据集

嗯,不是任何数据集。当前形式的这个宏处理 id 变量,这些变量是数字的,其值可以精确地表达为 cats() 发出的感知数字文字。 一个更健壮的版本将检查 id 变量的类型并引用与生成的 CASE 语句中的 id 值相比。最健壮的版本会有一个代码生成的 CASE 语句,它检查每个 put(..., RB8.)

的 id 值
%macro sql_transpose (data=, out=, by=, var=, id=, aggregate_function=sum, default=0, prefix=, suffix=);

/*
 * CASE statement codegener will need tweaking to handle character id variables (i.e. QUOTE of the &id)
 * CASE statement codegener will need tweaking to handle numeric id variables that have non-integer values
 * inexpressible as a simple source code numeric literal. (i.e. may need to compare data when formnatted as RB4.);
 */

  %local case_statements;

  proc sql noprint;
    select
    "&aggregate_function ("
    || "CASE when &id = " || cats(idValues.&id) || " then &var else &default end"   
    || ") as &prefix" || cats(idValues.&id) || "&suffix"
    into :case_statements
    separated by ','
    from (select distinct &id from &data) as idValues
    order by &id
    ;

  %*put NOTE: %superq(case_statements);

    create table &out as
    select &by, &case_statements
    from &data
    group by &by;
  quit;

%mend;

%sql_transpose 
( data=have
, out=wantAcross3
, by=units_sold
, id=cost_cat
, var=cost
, prefix=cost_sum_
);

提示:通过一些更改,代码生成可以 Proc SQL 直通并远程执行枢轴。需要特别注意收集 case_statements.

背后的数据

方式 6 - 哈希 table:数字索引数据透视列

如果你是一个哈希狂,这个代码可能看起来并不奢侈。

data _null_;
  if 0 then set have(keep=units_sold cost_cat cost); * prep pdv;

  * hash for tracking id values;

  declare hash ids(ordered:'a');
  ids.defineKey('cost_cat');
  ids.defineDone();

  * hash for tracking sums
  * NOTE: The data node has a sum variable instead of using 
  * argument tags suminc: and keysum:  This was done because HITER NEXT() does not 
  * automatically load the keysum value into its PDV variable (meaning
  * another lookup via .SUM() would have to occur in order to obtain it);

  call missing (cost_sum);

  declare hash sums(ordered:'a');
  sums.defineKey('units_sold', 'cost_cat');
  sums.defineData('units_sold', 'cost_cat', 'cost_sum');
  sums.defineDone();

  * scan the data - track the id values and sums for pivoted output;

  do while (not done);
    set have(keep=units_sold cost_cat cost) end=done;

    ids.ref();

    if (0 ne sums.find()) then cost_sum = 0;
    cost_sum + cost;
    sums.replace();
  end;

  * create a dynamic output target;
  * a pool of pdv host variables is required for target;

  array cells cost_sum_1 - cost_sum_10000;
  call missing (of cost_sum_1 - cost_sum_10000);

  * need to iterate across the id values in order to create a 
  * variable name that will be part of the wanted data node;

  declare hiter across('ids');

  declare hash want(ordered:'a');
  want.defineKey('units_sold');
  want.defineData('units_sold');
  do while (across.next() = 0);
    want.defineData(cats('cost_sum_',cost_cat));  * sneaky! ;
  end;
  want.defineDone();

  * populate target;
  * iterate through the sums filling in the PDV variables
  * associated with the dynamically defined data node;

  declare hiter item('sums');
  prior_key1 = .; prior_key2 = .;
  do while (item.next() = 0);
    if units_sold ne prior_key1 then do;
      * when the 'group' changes add an item to the want hash, which will reflect the state of the PDV;
      if prior_key1 ne . then do;
        key1_hold = units_sold;
        units_sold = prior_key1;
        want.add();                  * save 'row' to hash;
        units_sold = key1_hold;
        call missing (of cells(*));
      end;
    end;

    cells[cost_cat] = cost_sum;
    prior_key1 = units_sold;
  end;
  want.add();

  * output target;

  want.output (dataset:'wantAcross4');

  stop;
run;

验证

Proc COMPARE 将显示所有 want 输出都相同。

proc compare nomissing 
  noprint data=wantAcross1 compare=wantAcross2 out=diff1_2 outnoequal;
  id units_sold;
run;

proc compare 
  noprint data=wantAcross2 compare=wantAcross3 out=diff2_3 outnoequal;
  id units_sold;
run;

proc compare nomissing 
  noprint data=wantAcross3 compare=wantAcross4 out=diff3_4 outnoequal;
  id units_sold;
run;