如何根据日期对 SAS 中的数据子集进行平均?

How to average a subset of data in SAS conditional on a date?

我正在尝试编写 SAS 代码,该代码可以遍历包含事件日期的数据集,如下所示:

Data event;
     input Date;
     cards;
     20200428
     20200429
     ;
run;

并从另一个包含日期和交易量的数据集计算前三天的平均值,如下所示:

Data vol;
     input Date Volume;
     cards;
     20200430  100
     20200429  110
     20200428  86
     20200427  95
     20200426  80
     20200425  90
     ;
run;

例如,对于日期 20200428,平均值应为 88.33 [(95+80+90)/3],对于日期 20200429,平均值应为 87.00 [(86+95+80)/3]。如果可能的话,我希望将这些值和日期的数量保存在如下所示的新数据集上。

Data clean;
     input Date Vol Avg;
     cards;
     20200428 86 88.33
     20200429 110 87.00
     ;
run;

我正在处理的实际数据是 1970-2010 年的数据。我也可能将我的平均周期从 3 天前增加到 10 天前,所以我想要灵活的代码。从我读过的内容来看,我认为宏 and/or 调用 symput 可能对此非常有效,但我不确定如何编写这些代码来执行我想要的操作。老实说,我不知道从哪里开始。谁能指出我正确的方向?我对任何 advice/ideas 持开放态度。谢谢

您想遍历输入数据集中的一系列日期。因此,我使用 PROC SQL 语句,其中我 select 将此输入数据集中的不同日期设置为宏变量。 然后使用此宏变量进行循环。因此,在您的示例中,宏变量将是:20200428 20200429。然后,您可以使用 %SCAN 宏函数开始循环遍历这些日期。

对于循环中的每个日期,我们将计算平均值:在您的示例中,循环日期之前 3 天的平均值。由于要计算平均值的天数是可变的,因此它也作为参数传递到宏中。然后我使用 INTNX function 来计算你想要的日期的下限 select 来计算平均值。然后 PROC MEANS 程序用于计算这些天的平均交易量:下限 - 循环日期。

然后我在两者之间放置了一个小数据步骤,将循环日期再次附加到计算的平均值。最后,所有内容都附加到最终数据集中。

%macro dayAverage(input = , range = , selectiondata = );

  /* Input = input dataset
     range = number of days prior to the selected date for which you want to calculate 
             the average
     selectiondata = data where the volumes are in */

  /* Create a macro variable with the dates for which you want to calculate the 
     average, to loop over */

    proc sql noprint;
      select distinct date into: datesrange separated by " "
      from &input.;
    quit;

  /*Start looping over the dates for which you want to calculate the average */

    %let I = 1;
    %do %while (%scan(&datesrange.,&I.) ne %str());

        /* Assign the current date in the loop to the variable currentdate */

        %let currentdate =  %scan(&datesrange.,&I.);

        /* Create the minimum date in the range based on input parameter range */

      %let mindate = 
      %sysfunc(putn(%sysfunc(intnx(day,%sysfunc(inputn(&currentdate.,yymmdd8.)),- 
      &range.)),yymmddn8.));

      /* Calculate the mean volume for the selected date and selected range */

      proc means data = &selectiondata.(where = (date >= &mindate. and date < 
      &currentdate.)) noprint ;
      output out  = averagecurrent(drop = _type_ _freq_) mean(volume)=avgerage_volume;
      run;

      /* Add the current date to the calculated average */

      data averagecurrent;
        retain date average_volume;
        set averagecurrent;

        date = &currentdate.;
      run;

     /* Append the result to a final list */
        proc datasets nolist;
        append base = final data = averagecurrent force;
        run;

        %let I = %eval(&I. + 1);

  %end;
 %mend;

这个宏在你的例子中可以被称为:

%dayAverage(input = event, range = 3, selectiondata = vol);

它将在您的工作库中为您提供一个名为 final

的数据集

A SQL 语句是迄今为止获取结果集的最简洁的代码。 该查询将加入 2 个对卷数据的独立引用。第一个用于获取事件日期的交易量,第二个用于计算前三天的平均交易量。

date 数据应作为 SAS 日期读入,以便 BETWEEN 条件正确。

Data event;
     input Date: yymmdd8.;
     cards;
     20200428
     20200429
     ;
run;

Data vol;
     input Date: yymmdd8. Volume;
     cards;
     20200430  100
     20200429  110
     20200428  86
     20200427  95
     20200426  80
     20200425  90
     ;
run;

* SQL 使用 GROUP BY 查询;

proc sql;
  create table want as
  select 
    event.date
  , volume_one.volume
  , mean(volume_two.volume) as avg
  from event
  left join vol as volume_one
  on event.date = volume_one.date
  left join vol as volume_two
  on volume_two.date between event.date-1 and event.date-3
  group by 
  event.date, volume_one.volume
  ;

* 使用相关子查询的替代查询;

  create table want_2 as
  select 
    event.date
  , volume
  , ( select mean(volume) as avg from vol where vol.date between event.date-1 and event.date-3 )
    as avg
  from event
  left join vol
  on event.date = vol.date
  ;

对于 Volumes 数据存在日期间隔的情况,更好的解决方案是单独计算 N 个先前卷的滚动平均值。日期间隔可能来自周末、节假日或由于数据输入问题或操作员错误而未出现的日期。从概念上讲,对于平均,date 的唯一作用只是对数据进行排序。

计算滚动平均值后,可以进行简单的joinmerge

示例:

* Simulate some volume data that excludes weekends, holidays, and a 2% rate of missing dates;

data volumes(keep=date volume);
  call streaminit(20200502);
  do date = '01jan1970'd to today();
    length holiday ;
    year = year(date);
    holiday = 'NEWYEAR';         hdate = holiday(holiday, year); if date=hdate then continue;
    holiday = 'USINDEPENDENCE';  hdate = holiday(holiday, year); if date=hdate then continue;
    holiday = 'THANKSGIVING';    hdate = holiday(holiday, year); if date=hdate then continue;
    holiday = 'CHRISTMAS';       hdate = holiday(holiday, year); if date=hdate then continue;
    holiday = 'MEMORIAL';        hdate = holiday(holiday, year); if date=hdate then continue;
    holiday = 'LABOR';           hdate = holiday(holiday, year); if date=hdate then continue;
    holiday = 'EASTER';          hdate = holiday(holiday, year); if date=hdate then continue;
    holiday = 'USPRESIDENTS';    hdate = holiday(holiday, year); if date=hdate then continue;

    if weekday(date) in (1,7) then continue; *1=Sun, 7=Sat;

    volume = 100 + ceil(75 * sin (date / 8));

    if rand('uniform') < 0.02 then continue;

    output;
  end;
  format date yymmdd10.;
run;
* Compute an N item rolling average from N prior values;

%let ROLLING_N = 5;

data volume_averages;

  set volumes;
  by date;      * enforce sort order requirement;

  array v[0:&ROLLING_N] _temporary_;          %* <---- &ROLLING_N ;
  retain index -1;

  avg_prior_&ROLLING_N. = mean (of v(*));     %* <---- &ROLLING_N ;

  OUTPUT;

  index = mod(index + 1,&ROLLING_N);     %* <---- Modular arithmetic, the foundation of rolling ;
  v[index] = volume;

  format v: 6.;
  drop index;
run;
* merge;

data want_merge;
  merge events(in=event_date) volume_averages;
  by date;

  if event_date;
run;

* join;

proc sql;
  create table want_join as
  select events.*, volume_averages.avg_prior_5
  from events join volume_averages
  on events.date = volume_averages.date;
quit;