如何根据日期对 SAS 中的数据子集进行平均?
How to average a subset of data in SAS conditional on a date?
我正在尝试编写 SAS 代码,该代码可以遍历包含事件日期的数据集,如下所示:
Data event;
input Date;
cards;
20200428
20200429
;
run;
并从另一个包含日期和交易量的数据集计算前三天的平均值,如下所示:
Data vol;
input Date Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
例如,对于日期 20200428,平均值应为 88.33 [(95+80+90)/3],对于日期 20200429,平均值应为 87.00 [(86+95+80)/3]。如果可能的话,我希望将这些值和日期的数量保存在如下所示的新数据集上。
Data clean;
input Date Vol Avg;
cards;
20200428 86 88.33
20200429 110 87.00
;
run;
我正在处理的实际数据是 1970-2010 年的数据。我也可能将我的平均周期从 3 天前增加到 10 天前,所以我想要灵活的代码。从我读过的内容来看,我认为宏 and/or 调用 symput 可能对此非常有效,但我不确定如何编写这些代码来执行我想要的操作。老实说,我不知道从哪里开始。谁能指出我正确的方向?我对任何 advice/ideas 持开放态度。谢谢
您想遍历输入数据集中的一系列日期。因此,我使用 PROC SQL
语句,其中我 select 将此输入数据集中的不同日期设置为宏变量。
然后使用此宏变量进行循环。因此,在您的示例中,宏变量将是:20200428 20200429
。然后,您可以使用 %SCAN
宏函数开始循环遍历这些日期。
对于循环中的每个日期,我们将计算平均值:在您的示例中,循环日期之前 3 天的平均值。由于要计算平均值的天数是可变的,因此它也作为参数传递到宏中。然后我使用 INTNX function
来计算你想要的日期的下限 select 来计算平均值。然后 PROC MEANS
程序用于计算这些天的平均交易量:下限 - 循环日期。
然后我在两者之间放置了一个小数据步骤,将循环日期再次附加到计算的平均值。最后,所有内容都附加到最终数据集中。
%macro dayAverage(input = , range = , selectiondata = );
/* Input = input dataset
range = number of days prior to the selected date for which you want to calculate
the average
selectiondata = data where the volumes are in */
/* Create a macro variable with the dates for which you want to calculate the
average, to loop over */
proc sql noprint;
select distinct date into: datesrange separated by " "
from &input.;
quit;
/*Start looping over the dates for which you want to calculate the average */
%let I = 1;
%do %while (%scan(&datesrange.,&I.) ne %str());
/* Assign the current date in the loop to the variable currentdate */
%let currentdate = %scan(&datesrange.,&I.);
/* Create the minimum date in the range based on input parameter range */
%let mindate =
%sysfunc(putn(%sysfunc(intnx(day,%sysfunc(inputn(¤tdate.,yymmdd8.)),-
&range.)),yymmddn8.));
/* Calculate the mean volume for the selected date and selected range */
proc means data = &selectiondata.(where = (date >= &mindate. and date <
¤tdate.)) noprint ;
output out = averagecurrent(drop = _type_ _freq_) mean(volume)=avgerage_volume;
run;
/* Add the current date to the calculated average */
data averagecurrent;
retain date average_volume;
set averagecurrent;
date = ¤tdate.;
run;
/* Append the result to a final list */
proc datasets nolist;
append base = final data = averagecurrent force;
run;
%let I = %eval(&I. + 1);
%end;
%mend;
这个宏在你的例子中可以被称为:
%dayAverage(input = event, range = 3, selectiondata = vol);
它将在您的工作库中为您提供一个名为 final
的数据集
A SQL 语句是迄今为止获取结果集的最简洁的代码。
该查询将加入 2 个对卷数据的独立引用。第一个用于获取事件日期的交易量,第二个用于计算前三天的平均交易量。
date
数据应作为 SAS 日期读入,以便 BETWEEN 条件正确。
Data event;
input Date: yymmdd8.;
cards;
20200428
20200429
;
run;
Data vol;
input Date: yymmdd8. Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
* SQL 使用 GROUP BY 查询;
proc sql;
create table want as
select
event.date
, volume_one.volume
, mean(volume_two.volume) as avg
from event
left join vol as volume_one
on event.date = volume_one.date
left join vol as volume_two
on volume_two.date between event.date-1 and event.date-3
group by
event.date, volume_one.volume
;
* 使用相关子查询的替代查询;
create table want_2 as
select
event.date
, volume
, ( select mean(volume) as avg from vol where vol.date between event.date-1 and event.date-3 )
as avg
from event
left join vol
on event.date = vol.date
;
对于 Volumes
数据存在日期间隔的情况,更好的解决方案是单独计算 N 个先前卷的滚动平均值。日期间隔可能来自周末、节假日或由于数据输入问题或操作员错误而未出现的日期。从概念上讲,对于平均,date
的唯一作用只是对数据进行排序。
计算滚动平均值后,可以进行简单的join
或merge
。
示例:
* Simulate some volume data that excludes weekends, holidays, and a 2% rate of missing dates;
data volumes(keep=date volume);
call streaminit(20200502);
do date = '01jan1970'd to today();
length holiday ;
year = year(date);
holiday = 'NEWYEAR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USINDEPENDENCE'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'THANKSGIVING'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'CHRISTMAS'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'MEMORIAL'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'LABOR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'EASTER'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USPRESIDENTS'; hdate = holiday(holiday, year); if date=hdate then continue;
if weekday(date) in (1,7) then continue; *1=Sun, 7=Sat;
volume = 100 + ceil(75 * sin (date / 8));
if rand('uniform') < 0.02 then continue;
output;
end;
format date yymmdd10.;
run;
* Compute an N item rolling average from N prior values;
%let ROLLING_N = 5;
data volume_averages;
set volumes;
by date; * enforce sort order requirement;
array v[0:&ROLLING_N] _temporary_; %* <---- &ROLLING_N ;
retain index -1;
avg_prior_&ROLLING_N. = mean (of v(*)); %* <---- &ROLLING_N ;
OUTPUT;
index = mod(index + 1,&ROLLING_N); %* <---- Modular arithmetic, the foundation of rolling ;
v[index] = volume;
format v: 6.;
drop index;
run;
* merge;
data want_merge;
merge events(in=event_date) volume_averages;
by date;
if event_date;
run;
* join;
proc sql;
create table want_join as
select events.*, volume_averages.avg_prior_5
from events join volume_averages
on events.date = volume_averages.date;
quit;
我正在尝试编写 SAS 代码,该代码可以遍历包含事件日期的数据集,如下所示:
Data event;
input Date;
cards;
20200428
20200429
;
run;
并从另一个包含日期和交易量的数据集计算前三天的平均值,如下所示:
Data vol;
input Date Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
例如,对于日期 20200428,平均值应为 88.33 [(95+80+90)/3],对于日期 20200429,平均值应为 87.00 [(86+95+80)/3]。如果可能的话,我希望将这些值和日期的数量保存在如下所示的新数据集上。
Data clean;
input Date Vol Avg;
cards;
20200428 86 88.33
20200429 110 87.00
;
run;
我正在处理的实际数据是 1970-2010 年的数据。我也可能将我的平均周期从 3 天前增加到 10 天前,所以我想要灵活的代码。从我读过的内容来看,我认为宏 and/or 调用 symput 可能对此非常有效,但我不确定如何编写这些代码来执行我想要的操作。老实说,我不知道从哪里开始。谁能指出我正确的方向?我对任何 advice/ideas 持开放态度。谢谢
您想遍历输入数据集中的一系列日期。因此,我使用 PROC SQL
语句,其中我 select 将此输入数据集中的不同日期设置为宏变量。
然后使用此宏变量进行循环。因此,在您的示例中,宏变量将是:20200428 20200429
。然后,您可以使用 %SCAN
宏函数开始循环遍历这些日期。
对于循环中的每个日期,我们将计算平均值:在您的示例中,循环日期之前 3 天的平均值。由于要计算平均值的天数是可变的,因此它也作为参数传递到宏中。然后我使用 INTNX function
来计算你想要的日期的下限 select 来计算平均值。然后 PROC MEANS
程序用于计算这些天的平均交易量:下限 - 循环日期。
然后我在两者之间放置了一个小数据步骤,将循环日期再次附加到计算的平均值。最后,所有内容都附加到最终数据集中。
%macro dayAverage(input = , range = , selectiondata = );
/* Input = input dataset
range = number of days prior to the selected date for which you want to calculate
the average
selectiondata = data where the volumes are in */
/* Create a macro variable with the dates for which you want to calculate the
average, to loop over */
proc sql noprint;
select distinct date into: datesrange separated by " "
from &input.;
quit;
/*Start looping over the dates for which you want to calculate the average */
%let I = 1;
%do %while (%scan(&datesrange.,&I.) ne %str());
/* Assign the current date in the loop to the variable currentdate */
%let currentdate = %scan(&datesrange.,&I.);
/* Create the minimum date in the range based on input parameter range */
%let mindate =
%sysfunc(putn(%sysfunc(intnx(day,%sysfunc(inputn(¤tdate.,yymmdd8.)),-
&range.)),yymmddn8.));
/* Calculate the mean volume for the selected date and selected range */
proc means data = &selectiondata.(where = (date >= &mindate. and date <
¤tdate.)) noprint ;
output out = averagecurrent(drop = _type_ _freq_) mean(volume)=avgerage_volume;
run;
/* Add the current date to the calculated average */
data averagecurrent;
retain date average_volume;
set averagecurrent;
date = ¤tdate.;
run;
/* Append the result to a final list */
proc datasets nolist;
append base = final data = averagecurrent force;
run;
%let I = %eval(&I. + 1);
%end;
%mend;
这个宏在你的例子中可以被称为:
%dayAverage(input = event, range = 3, selectiondata = vol);
它将在您的工作库中为您提供一个名为 final
A SQL 语句是迄今为止获取结果集的最简洁的代码。 该查询将加入 2 个对卷数据的独立引用。第一个用于获取事件日期的交易量,第二个用于计算前三天的平均交易量。
date
数据应作为 SAS 日期读入,以便 BETWEEN 条件正确。
Data event;
input Date: yymmdd8.;
cards;
20200428
20200429
;
run;
Data vol;
input Date: yymmdd8. Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
* SQL 使用 GROUP BY 查询;
proc sql;
create table want as
select
event.date
, volume_one.volume
, mean(volume_two.volume) as avg
from event
left join vol as volume_one
on event.date = volume_one.date
left join vol as volume_two
on volume_two.date between event.date-1 and event.date-3
group by
event.date, volume_one.volume
;
* 使用相关子查询的替代查询;
create table want_2 as
select
event.date
, volume
, ( select mean(volume) as avg from vol where vol.date between event.date-1 and event.date-3 )
as avg
from event
left join vol
on event.date = vol.date
;
对于 Volumes
数据存在日期间隔的情况,更好的解决方案是单独计算 N 个先前卷的滚动平均值。日期间隔可能来自周末、节假日或由于数据输入问题或操作员错误而未出现的日期。从概念上讲,对于平均,date
的唯一作用只是对数据进行排序。
计算滚动平均值后,可以进行简单的join
或merge
。
示例:
* Simulate some volume data that excludes weekends, holidays, and a 2% rate of missing dates;
data volumes(keep=date volume);
call streaminit(20200502);
do date = '01jan1970'd to today();
length holiday ;
year = year(date);
holiday = 'NEWYEAR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USINDEPENDENCE'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'THANKSGIVING'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'CHRISTMAS'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'MEMORIAL'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'LABOR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'EASTER'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USPRESIDENTS'; hdate = holiday(holiday, year); if date=hdate then continue;
if weekday(date) in (1,7) then continue; *1=Sun, 7=Sat;
volume = 100 + ceil(75 * sin (date / 8));
if rand('uniform') < 0.02 then continue;
output;
end;
format date yymmdd10.;
run;
* Compute an N item rolling average from N prior values;
%let ROLLING_N = 5;
data volume_averages;
set volumes;
by date; * enforce sort order requirement;
array v[0:&ROLLING_N] _temporary_; %* <---- &ROLLING_N ;
retain index -1;
avg_prior_&ROLLING_N. = mean (of v(*)); %* <---- &ROLLING_N ;
OUTPUT;
index = mod(index + 1,&ROLLING_N); %* <---- Modular arithmetic, the foundation of rolling ;
v[index] = volume;
format v: 6.;
drop index;
run;
* merge;
data want_merge;
merge events(in=event_date) volume_averages;
by date;
if event_date;
run;
* join;
proc sql;
create table want_join as
select events.*, volume_averages.avg_prior_5
from events join volume_averages
on events.date = volume_averages.date;
quit;