SAS - PROC SQL - 将值汇总到唯一列中
SAS - PROC SQL - Sum values into unique columns
简化我的描述 table 以使我的问题简洁...
我有一个包含 3 列的数据集。第一列包含 100 成本类别(即唯一键),第二列包含 成本 对于给定的成本类别,第三个包含 售出单位 .
我的目标是将其变成一个 table,每个 CostCat 都有一列,其中包含 [该给定类别的 =23=]Cost 字段,按 UnitsSold.
分组
即
╔════════════╦══════════╦══════════╦═══════
║ UnitsSold ║ CatCost1 ║ CatCost2 ║ CostCat...
╠════════════╬══════════╬══════════╬═══════
║ 1 ║ 50 ║ 10 ║ ...
║ 2 ║ 20 ║ 15 ║ ...
║ ... ║ ... ║ ... ║ ...
╚════════════╩══════════╩══════════╩═══════
我倾向于使用这样的代码:
PROC SQL;
CREATE TABLE cartesian AS
SELECT
UnitsSold,
SUM(CASE WHEN CostCat=1 THEN Cost else 0 end) as CostCat1,
sum(case when CostCat=1 then Cost else 0 end) as CostCat2,
sum(case when CostCat=1 then Cost else 0 end) as CostCat3,
...
sum(case when CostCat=100 then Cost else 0 end) as CostCat100
GROUP BY UnitsSold;
QUIT;
我想知道是否有比编写大量可笑的 CASE 语句更有效的方法? (显然使用 Excel 来生成实际的打字)。
我想可能存在某种类型的宏循环,但对宏还不够熟悉,不知道如何执行此操作。
我传统上使用 PROC SQL,所以这是我的首选,但也对 SAS 代码解决方案持开放态度
正如 Reeza 指出的那样,最好的方法可能是通过 proc sql
或 proc means/summary
和 proc transpose
的组合。我假设你知道 SQL 所以我会先进入那个描述。
proc sql;
create table tmp as
select UnitsSold, CostCat, sum(cost) as cost
from have
group by UnitsSold, CostCat;
quit;
如果您想通过 SAS 过程执行此操作,您可以使用 proc summary
。
proc summary data=have nway missing;
class UnitsSold CostCat;
var Cost;
output out=tmp(drop=_:) sum=; ** drop=_: removes the automatic variables created in the procedure;
run;
现在 table 已按 UnitsSold
和 CostCat
汇总和排序,您可以转置 table。
proc transpose data=tmp out=want(drop=_NAME_) prefix=CostCat;
by UnitsSold;
id CostCat;
var cost;
run;
迈克尔:
问题是描述 PIVOT 操作,在 SAS 术语中也称为 TRANSPOSE,Paste/Special 在 Excel 中称为转置或 PIVOT table。
如果您坚持使用 Proc SQL 语句,则没有 PIVOT 运算符。 SQL 服务器和其他数据库确实有 PIVOT 运算符。但假设您坚持使用 SAS Proc SQL。您是正确的,您将需要那么多 CASE 语句来创建跨变量。
在 SAS 中有多种数据透视方法。这里有六种方法:
示例数据
data have;
do row = 1 to 500;
cost_cat = ceil(100 * ranuni(123));
cost = 10 + floor(50 * ranuni(123));
units_sold = floor (20 * ranuni(123));
output;
end;
run;
方式 1 - Proc TRANSPOSE:仅用于展示的枢轴
Class变量在table语句中用于布局行和列。
proc tabulate data=have;
class cost_cat units_sold;
var cost;
table units_sold, cost_cat*cost*sum / nocellmerge;
run;
方式 2 - Proc REPORT:仅用于展示的枢轴
成本类别和成本列堆叠在一起。 Cost
没有 define
语句,默认为 display sum
。对每个组中的成本超过值执行总和 * across:
proc report data=have;
columns units_sold (cost_cat, cost) ;
define units_sold / group;
define cost_cat / across;
run;
方式 3 - Proc MEANS + Proc TRANSPOSE:数据的枢轴
Transpose 将创建一个包含列 'out of order' 的数据集,因为这些列是按照您逐步执行 units_solds.
[= 时 id 值出现的顺序创建的70=]这可以通过向 have
添加额外的数据来防止。数据将具有 units_sold = -1 并且每个 cost_cat 值都有一行。额外的组作为 TRANSPOSE out= 数据集选项的一部分被删除——例如:(... where=(units_sold ne -1))
proc means noprint data=have;
class units_sold cost_cat;
var cost;
ways 2;
output sum=sum out=haveMeans ;
run;
proc transpose data=haveMeans out=wantAcross1(drop=_name_) prefix=cost_sum_;
by units_sold;
var sum;
id cost_cat;
;
run;
方式4 - SQL宏生成的`wallpaper`代码:特定于一个数据集
宏更简单,因为它特定于所讨论的数据集。 对于更一般的情况,语句生成的显着方面可以被抽象化并进一步宏化(参见方式 5)
%macro pivot_across;
%local i;
proc sql;
create table wantAcross2 as
select units_sold
%do i = 1 %to 100; %* codegen many sql select expressions;
, sum ( case when cost_cat = &i then cost else 0 end ) as cost_sum_&i
%end;
from have
group by units_sold;
quit;
%mend;
%pivot_across;
提示:通过一些更改,代码生成可以 Proc SQL 直通并远程执行枢轴。
方式5 - SQL宏生成的`wallpaper`代码:任意数据集
嗯,不是任何数据集。当前形式的这个宏处理 id 变量,这些变量是数字的,其值可以精确地表达为 cats()
发出的感知数字文字。 一个更健壮的版本将检查 id 变量的类型并引用与生成的 CASE 语句中的 id 值相比。最健壮的版本会有一个代码生成的 CASE 语句,它检查每个 put(..., RB8.)
的 id 值
%macro sql_transpose (data=, out=, by=, var=, id=, aggregate_function=sum, default=0, prefix=, suffix=);
/*
* CASE statement codegener will need tweaking to handle character id variables (i.e. QUOTE of the &id)
* CASE statement codegener will need tweaking to handle numeric id variables that have non-integer values
* inexpressible as a simple source code numeric literal. (i.e. may need to compare data when formnatted as RB4.);
*/
%local case_statements;
proc sql noprint;
select
"&aggregate_function ("
|| "CASE when &id = " || cats(idValues.&id) || " then &var else &default end"
|| ") as &prefix" || cats(idValues.&id) || "&suffix"
into :case_statements
separated by ','
from (select distinct &id from &data) as idValues
order by &id
;
%*put NOTE: %superq(case_statements);
create table &out as
select &by, &case_statements
from &data
group by &by;
quit;
%mend;
%sql_transpose
( data=have
, out=wantAcross3
, by=units_sold
, id=cost_cat
, var=cost
, prefix=cost_sum_
);
提示:通过一些更改,代码生成可以 Proc SQL 直通并远程执行枢轴。需要特别注意收集 case_statements
.
背后的数据
方式 6 - 哈希 table:数字索引数据透视列
如果你是一个哈希狂,这个代码可能看起来并不奢侈。
data _null_;
if 0 then set have(keep=units_sold cost_cat cost); * prep pdv;
* hash for tracking id values;
declare hash ids(ordered:'a');
ids.defineKey('cost_cat');
ids.defineDone();
* hash for tracking sums
* NOTE: The data node has a sum variable instead of using
* argument tags suminc: and keysum: This was done because HITER NEXT() does not
* automatically load the keysum value into its PDV variable (meaning
* another lookup via .SUM() would have to occur in order to obtain it);
call missing (cost_sum);
declare hash sums(ordered:'a');
sums.defineKey('units_sold', 'cost_cat');
sums.defineData('units_sold', 'cost_cat', 'cost_sum');
sums.defineDone();
* scan the data - track the id values and sums for pivoted output;
do while (not done);
set have(keep=units_sold cost_cat cost) end=done;
ids.ref();
if (0 ne sums.find()) then cost_sum = 0;
cost_sum + cost;
sums.replace();
end;
* create a dynamic output target;
* a pool of pdv host variables is required for target;
array cells cost_sum_1 - cost_sum_10000;
call missing (of cost_sum_1 - cost_sum_10000);
* need to iterate across the id values in order to create a
* variable name that will be part of the wanted data node;
declare hiter across('ids');
declare hash want(ordered:'a');
want.defineKey('units_sold');
want.defineData('units_sold');
do while (across.next() = 0);
want.defineData(cats('cost_sum_',cost_cat)); * sneaky! ;
end;
want.defineDone();
* populate target;
* iterate through the sums filling in the PDV variables
* associated with the dynamically defined data node;
declare hiter item('sums');
prior_key1 = .; prior_key2 = .;
do while (item.next() = 0);
if units_sold ne prior_key1 then do;
* when the 'group' changes add an item to the want hash, which will reflect the state of the PDV;
if prior_key1 ne . then do;
key1_hold = units_sold;
units_sold = prior_key1;
want.add(); * save 'row' to hash;
units_sold = key1_hold;
call missing (of cells(*));
end;
end;
cells[cost_cat] = cost_sum;
prior_key1 = units_sold;
end;
want.add();
* output target;
want.output (dataset:'wantAcross4');
stop;
run;
验证
Proc COMPARE
将显示所有 want
输出都相同。
proc compare nomissing
noprint data=wantAcross1 compare=wantAcross2 out=diff1_2 outnoequal;
id units_sold;
run;
proc compare
noprint data=wantAcross2 compare=wantAcross3 out=diff2_3 outnoequal;
id units_sold;
run;
proc compare nomissing
noprint data=wantAcross3 compare=wantAcross4 out=diff3_4 outnoequal;
id units_sold;
run;
简化我的描述 table 以使我的问题简洁...
我有一个包含 3 列的数据集。第一列包含 100 成本类别(即唯一键),第二列包含 成本 对于给定的成本类别,第三个包含 售出单位 .
我的目标是将其变成一个 table,每个 CostCat 都有一列,其中包含 [该给定类别的 =23=]Cost 字段,按 UnitsSold.
分组
即
╔════════════╦══════════╦══════════╦═══════
║ UnitsSold ║ CatCost1 ║ CatCost2 ║ CostCat...
╠════════════╬══════════╬══════════╬═══════
║ 1 ║ 50 ║ 10 ║ ...
║ 2 ║ 20 ║ 15 ║ ...
║ ... ║ ... ║ ... ║ ...
╚════════════╩══════════╩══════════╩═══════
我倾向于使用这样的代码:
PROC SQL;
CREATE TABLE cartesian AS
SELECT
UnitsSold,
SUM(CASE WHEN CostCat=1 THEN Cost else 0 end) as CostCat1,
sum(case when CostCat=1 then Cost else 0 end) as CostCat2,
sum(case when CostCat=1 then Cost else 0 end) as CostCat3,
...
sum(case when CostCat=100 then Cost else 0 end) as CostCat100
GROUP BY UnitsSold;
QUIT;
我想知道是否有比编写大量可笑的 CASE 语句更有效的方法? (显然使用 Excel 来生成实际的打字)。
我想可能存在某种类型的宏循环,但对宏还不够熟悉,不知道如何执行此操作。
我传统上使用 PROC SQL,所以这是我的首选,但也对 SAS 代码解决方案持开放态度
正如 Reeza 指出的那样,最好的方法可能是通过 proc sql
或 proc means/summary
和 proc transpose
的组合。我假设你知道 SQL 所以我会先进入那个描述。
proc sql;
create table tmp as
select UnitsSold, CostCat, sum(cost) as cost
from have
group by UnitsSold, CostCat;
quit;
如果您想通过 SAS 过程执行此操作,您可以使用 proc summary
。
proc summary data=have nway missing;
class UnitsSold CostCat;
var Cost;
output out=tmp(drop=_:) sum=; ** drop=_: removes the automatic variables created in the procedure;
run;
现在 table 已按 UnitsSold
和 CostCat
汇总和排序,您可以转置 table。
proc transpose data=tmp out=want(drop=_NAME_) prefix=CostCat;
by UnitsSold;
id CostCat;
var cost;
run;
迈克尔:
问题是描述 PIVOT 操作,在 SAS 术语中也称为 TRANSPOSE,Paste/Special 在 Excel 中称为转置或 PIVOT table。
如果您坚持使用 Proc SQL 语句,则没有 PIVOT 运算符。 SQL 服务器和其他数据库确实有 PIVOT 运算符。但假设您坚持使用 SAS Proc SQL。您是正确的,您将需要那么多 CASE 语句来创建跨变量。
在 SAS 中有多种数据透视方法。这里有六种方法:
示例数据
data have;
do row = 1 to 500;
cost_cat = ceil(100 * ranuni(123));
cost = 10 + floor(50 * ranuni(123));
units_sold = floor (20 * ranuni(123));
output;
end;
run;
方式 1 - Proc TRANSPOSE:仅用于展示的枢轴
Class变量在table语句中用于布局行和列。
proc tabulate data=have;
class cost_cat units_sold;
var cost;
table units_sold, cost_cat*cost*sum / nocellmerge;
run;
方式 2 - Proc REPORT:仅用于展示的枢轴
成本类别和成本列堆叠在一起。 Cost
没有 define
语句,默认为 display sum
。对每个组中的成本超过值执行总和 * across:
proc report data=have;
columns units_sold (cost_cat, cost) ;
define units_sold / group;
define cost_cat / across;
run;
方式 3 - Proc MEANS + Proc TRANSPOSE:数据的枢轴
Transpose 将创建一个包含列 'out of order' 的数据集,因为这些列是按照您逐步执行 units_solds.
[= 时 id 值出现的顺序创建的70=]这可以通过向 have
添加额外的数据来防止。数据将具有 units_sold = -1 并且每个 cost_cat 值都有一行。额外的组作为 TRANSPOSE out= 数据集选项的一部分被删除——例如:(... where=(units_sold ne -1))
proc means noprint data=have;
class units_sold cost_cat;
var cost;
ways 2;
output sum=sum out=haveMeans ;
run;
proc transpose data=haveMeans out=wantAcross1(drop=_name_) prefix=cost_sum_;
by units_sold;
var sum;
id cost_cat;
;
run;
方式4 - SQL宏生成的`wallpaper`代码:特定于一个数据集
宏更简单,因为它特定于所讨论的数据集。 对于更一般的情况,语句生成的显着方面可以被抽象化并进一步宏化(参见方式 5)
%macro pivot_across;
%local i;
proc sql;
create table wantAcross2 as
select units_sold
%do i = 1 %to 100; %* codegen many sql select expressions;
, sum ( case when cost_cat = &i then cost else 0 end ) as cost_sum_&i
%end;
from have
group by units_sold;
quit;
%mend;
%pivot_across;
提示:通过一些更改,代码生成可以 Proc SQL 直通并远程执行枢轴。
方式5 - SQL宏生成的`wallpaper`代码:任意数据集
嗯,不是任何数据集。当前形式的这个宏处理 id 变量,这些变量是数字的,其值可以精确地表达为 cats()
发出的感知数字文字。 一个更健壮的版本将检查 id 变量的类型并引用与生成的 CASE 语句中的 id 值相比。最健壮的版本会有一个代码生成的 CASE 语句,它检查每个 put(..., RB8.)
%macro sql_transpose (data=, out=, by=, var=, id=, aggregate_function=sum, default=0, prefix=, suffix=);
/*
* CASE statement codegener will need tweaking to handle character id variables (i.e. QUOTE of the &id)
* CASE statement codegener will need tweaking to handle numeric id variables that have non-integer values
* inexpressible as a simple source code numeric literal. (i.e. may need to compare data when formnatted as RB4.);
*/
%local case_statements;
proc sql noprint;
select
"&aggregate_function ("
|| "CASE when &id = " || cats(idValues.&id) || " then &var else &default end"
|| ") as &prefix" || cats(idValues.&id) || "&suffix"
into :case_statements
separated by ','
from (select distinct &id from &data) as idValues
order by &id
;
%*put NOTE: %superq(case_statements);
create table &out as
select &by, &case_statements
from &data
group by &by;
quit;
%mend;
%sql_transpose
( data=have
, out=wantAcross3
, by=units_sold
, id=cost_cat
, var=cost
, prefix=cost_sum_
);
提示:通过一些更改,代码生成可以 Proc SQL 直通并远程执行枢轴。需要特别注意收集 case_statements
.
方式 6 - 哈希 table:数字索引数据透视列
如果你是一个哈希狂,这个代码可能看起来并不奢侈。
data _null_;
if 0 then set have(keep=units_sold cost_cat cost); * prep pdv;
* hash for tracking id values;
declare hash ids(ordered:'a');
ids.defineKey('cost_cat');
ids.defineDone();
* hash for tracking sums
* NOTE: The data node has a sum variable instead of using
* argument tags suminc: and keysum: This was done because HITER NEXT() does not
* automatically load the keysum value into its PDV variable (meaning
* another lookup via .SUM() would have to occur in order to obtain it);
call missing (cost_sum);
declare hash sums(ordered:'a');
sums.defineKey('units_sold', 'cost_cat');
sums.defineData('units_sold', 'cost_cat', 'cost_sum');
sums.defineDone();
* scan the data - track the id values and sums for pivoted output;
do while (not done);
set have(keep=units_sold cost_cat cost) end=done;
ids.ref();
if (0 ne sums.find()) then cost_sum = 0;
cost_sum + cost;
sums.replace();
end;
* create a dynamic output target;
* a pool of pdv host variables is required for target;
array cells cost_sum_1 - cost_sum_10000;
call missing (of cost_sum_1 - cost_sum_10000);
* need to iterate across the id values in order to create a
* variable name that will be part of the wanted data node;
declare hiter across('ids');
declare hash want(ordered:'a');
want.defineKey('units_sold');
want.defineData('units_sold');
do while (across.next() = 0);
want.defineData(cats('cost_sum_',cost_cat)); * sneaky! ;
end;
want.defineDone();
* populate target;
* iterate through the sums filling in the PDV variables
* associated with the dynamically defined data node;
declare hiter item('sums');
prior_key1 = .; prior_key2 = .;
do while (item.next() = 0);
if units_sold ne prior_key1 then do;
* when the 'group' changes add an item to the want hash, which will reflect the state of the PDV;
if prior_key1 ne . then do;
key1_hold = units_sold;
units_sold = prior_key1;
want.add(); * save 'row' to hash;
units_sold = key1_hold;
call missing (of cells(*));
end;
end;
cells[cost_cat] = cost_sum;
prior_key1 = units_sold;
end;
want.add();
* output target;
want.output (dataset:'wantAcross4');
stop;
run;
验证
Proc COMPARE
将显示所有 want
输出都相同。
proc compare nomissing
noprint data=wantAcross1 compare=wantAcross2 out=diff1_2 outnoequal;
id units_sold;
run;
proc compare
noprint data=wantAcross2 compare=wantAcross3 out=diff2_3 outnoequal;
id units_sold;
run;
proc compare nomissing
noprint data=wantAcross3 compare=wantAcross4 out=diff3_4 outnoequal;
id units_sold;
run;