创建计算其他变量的 "levels" 的变量

Creating variables that count the "levels" of other variables

我有一个类似于下面简化的 table 的数据集(我们称之为 "DS_have"):

SurveyID    Participant FavoriteColor   FavoriteFood    SurveyMonth
S101        G92         Blue            Pizza           Jan
S102        B34         Blue            Cake            Feb
S103        Z28         Green           Cake            Feb
S104        V11         Red             Cake            Feb
S105        P03         Yellow          Pizza           Mar
S106        A71         Red             Pizza           Mar
S107        C48         Green           Cake            Mar
S108        G92         Blue            Cake            Apr
...

我想创建一组数值变量,用于标识上述数据集中每个变量的离散 categories/levels。结果应类似于以下数据集 ("DS_want"):

SurveyID    Participant FavoriteColor   FavoriteFood    SurveyMonth ColorLevels FoodLevels  ParticipantLevels   MonthLevels
S101        G92        Blue             Pizza           Jan                   1          1                  1             1
S102        B34        Blue             Cake            Feb                   1          2                  2             2
S103        Z28        Green            Cake            Feb                   2          2                  3             2
S104        V11        Red              Cake            Feb                   3          2                  4             2
S105        P03        Yellow           Pizza           Mar                   4          1                  5             3
S106        A71        Red              Pizza           Mar                   3          1                  6             3
S107        C48        Green            Cake            Mar                   2          2                  7             3
S108        G92        Blue             Cake            Apr                   1          1                  1             4
...

本质上,我想知道应该使用什么语法为 "level" 或 DS_Have 数据集中的每个变量或变量类别生成唯一数值.请注意,我不能使用条件 if/then 语句在每个类别的“:Levels”变量中创建值,因为某些变量的级别数以千计。

一个直接的解决方案是使用 proc tabulate 生成一个列表,然后迭代它并创建信息以将文本转换为数字;那么您只需使用 input 对其进行编码。

*store variables you want to work with in a macro variable to make this easier;
%let vars=FavoriteColor FavoriteFood SurveyMonth;

*run a tabulate to get the unique values;
proc tabulate data=have out=freqs;
  class &vars.;
  tables (&vars.),n;
run;

*if you prefer to have this in a particular order, sort by that now - otherwise you may have odd results (as this will).  Sort by _TYPE_ then your desired order.;


*Now create a dataset to read in for informat.;
data for_fmt;
  if 0 then set freqs;
  array vars &vars.;
  retain type 'i';
  do label = 1 by 1 until (last._type_);  *for each _type_, start with 1 and increment by 1;
    set freqs;
    by _type_ notsorted;
    which_var = find(_type_,'1');  *parses the '100' value from TYPE to see which variable this row is doing something to.  May not work if many variables - need another solution to identify which (depends on your data what works);

    start = coalescec(vars[which_var]);
    fmtname = cats(vname(vars[which_var]),'I');
    output;
    if first._type_ then do; *set up what to do if you encounter a new value not coded - set it to missing;
      hlo='o';  *this means OTHER;
      start=' ';
      label=.;
      output;
      hlo=' ';
      label=1;
    end;
  end;
run;

proc format cntlin=for_fmt;  *import to format catalog via PROC FORMAT;
quit;

然后像这样编写代码(您可以创建一个宏来循环 &vars 宏变量)。

data want;
  set have;
  color_code = input(FavoriteColor,FavoriteColorI.);
run;

另一种方法 - 创建一个散列对象来跟踪每个变量遇到的级别,并通过双 DOW 循环读取数据集两次,在第二次传递时应用级别编号。它可能不像 Joe 的解决方案那么优雅,但它应该使用更少的内存,我怀疑它会扩展到更多的变量。

%macro levels_rename(DATA,OUT,VARS,NEWVARS);
    %local i NUMVARS VARNAME;

    data &OUT;
    if 0 then set &DATA;
    length LEVEL 8;
    %let i = 1;
    %let VARNAME = %scan(&VARS,&i);
    %do %while(&VARNAME ne );
        declare hash h&i();
        rc = h&i..definekey("&VARNAME");
        rc = h&i..definedata("LEVEL");
        rc = h&i..definedone();
      %let i = %eval(&i + 1);
      %let VARNAME = %scan(&VARS,&i);
    %end;
    %let NUMVARS = %eval(&i - 1);
    do _n_ = 1 by 1 until(eof);
        set &DATA end = eof;
      %do i = 1 %to &NUMVARS;
        LEVEL = h&i..num_items + 1;
        rc = h&i..add();
      %end;
    end;
    do _n_ = 1 to _n_;
      set &DATA;
      %do i = 1 %to &NUMVARS;
        rc = h&i..find();
        %scan(&NEWVARS,&i) = LEVEL;
      %end;
      output;
    end;
    drop LEVEL;
    run;
%mend;

%levels_rename(sashelp.class,class_renamed,NAME SEX, NAME_L SEX_L);