基于多个事件导出属性
Derive attributes based on multiple events
我有数据想要转置,以便在任何时间点可视化单个 ID 的状态。
我一直在尝试遵循来自 的@Joe 的回答,但我在处理多种模式属性的情况时遇到了困难。
这是我拥有的基于事件的数据:
data have;
infile datalines delimiter="|";
input attrib :. multiple_attr :. id :. attrib_id :8. member_value :0. type :. dt_event :datetime18.;
format dt_event datetime20.;
datalines;
TYPE|N|ABC123|111|MEDIUM|Start|01DEC2014:00:00:00
TYPE|N|ABC123|111|MEDIUM|End|18APR2021:00:00:00
TYPE|N|ABC123|111|BIG|Start|19APR2021:00:00:00
TYPE|N|ABC123|111|BIG|End|31DEC2030:00:00:00
POSITION|N|ABC123|222|TOP|Start|01DEC2014:00:00:00
POSITION|N|ABC123|222|TOP|End|31DEC2030:00:00:00
IS_ACTIVE|N|ABC123|333|YES|Start|01DEC2014:00:00:00
IS_ACTIVE|N|ABC123|333|YES|End|31DEC2030:00:00:00
LEVELS|Y|ABC123|1|ALONE|Start|01DEC2014:00:00:00
LEVELS|Y|ABC123|1|BOTH|Start|01DEC2014:00:00:00
LEVELS|Y|ABC123|1|BOTH|End|18APR2021:00:00:00
LEVELS|Y|ABC123|1|ALONE|End|31DEC2030:00:00:00
TYPE|N|DEF456|111|MEDIUM|Start|01DEC2014:00:00:00
TYPE|N|DEF456|111|MEDIUM|End|31DEC2030:00:00:00
POSITION|N|DEF456|222|MID|Start|01DEC2014:00:00:00
POSITION|N|DEF456|222|MID|End|31DEC2030:00:00:00
IS_ACTIVE|N|DEF456|333|YES|Start|01MAR2014:00:00:00
IS_ACTIVE|N|DEF456|333|YES|End|31DEC2030:00:00:00
LEVELS|Y|DEF456|1|ALONE|Start|01MAR2014:00:00:00
LEVELS|Y|DEF456|1|BOTH|Start|01MAR2014:00:00:00
LEVELS|Y|DEF456|1|BOTH|End|31MAR2018:00:00:00
LEVELS|Y|DEF456|1|BOTH|Start|20AUG2018:00:00:00
LEVELS|Y|DEF456|1|ALONE|End|31DEC2030:00:00:00
LEVELS|Y|DEF456|1|BOTH|End|31DEC2030:00:00:00
;
使用@Joe 的方法:
proc sort data=have;
by id attrib_id dt_event member_value;
run;
data want;
set have(rename=member_value=in_value);
by id attrib_id dt_event;
retain start_date end_date member_value orig_value;
format member_value new_value 0.;
* First row per attrib_id is easy, just start it off with a START;
if first.attrib_id then do;
start_date = dt_event;
member_value = in_value;
end;
else do; *Now is the harder part;
* For ENDs, we want to remove the current member_value from the concatenated value string, always, and then if it is the last row for that dt_event, we want to output a new record;
if type='End' then do;
*remove the current (in_)value;
if first.dt_event then orig_value = member_value;
do _i = 1 to countw(member_value,';');
if scan(orig_value,_i,';') ne in_value then do;
if orig_value > scan(orig_value,_i,';') then new_value = catx('; ',scan(orig_value,_i,';'),new_value);
else new_value = catx('; ',new_value,scan(orig_value,_i,';'));
end;
end;
orig_value = new_value;
if last.dt_event then do;
end_date = dt_event;
output;
start_date = dt_event + 86400;
member_value = new_value;
orig_value = ' ';
end;
end;
else do;
* For START, we want to be more careful about outputting, as this will output lots of unwanted rows if we do not take care;
end_date = dt_event - 86400;
if start_date < end_date and not missing(member_value) then output;
if member_value > in_value then member_value = catx('; ',in_value,member_value);
else member_value = catx('; ',member_value,in_value);
start_date = dt_event;
end_date = .;
end;
end;
format start_date end_date datetime20.;
keep id multiple_attr attrib_id member_value start_date end_date;
run;
我最终得到:
+---------------+--------+-----------+--------------------+--------------------+-------------------+
| multiple_attr | id | attrib_id | start_date | end_date | member_value |
+---------------+--------+-----------+--------------------+--------------------+-------------------+
| Y | ABC123 | 1 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | ALONE; BOTH |
| Y | ABC123 | 1 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | BOTH; ALONE |
| N | ABC123 | 111 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | MEDIUM |
| N | ABC123 | 111 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | BIG |
| N | ABC123 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | TOP |
| N | ABC123 | 333 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | YES |
| Y | DEF456 | 1 | 01MAR2014:00:00:00 | 31MAR2018:00:00:00 | ALONE; BOTH |
| Y | DEF456 | 1 | 01APR2018:00:00:00 | 19AUG2018:00:00:00 | BOTH; ALONE |
| Y | DEF456 | 1 | 20AUG2018:00:00:00 | 31DEC2030:00:00:00 | BOTH; BOTH; ALONE |
| N | DEF456 | 111 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MEDIUM |
| N | DEF456 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MID |
| N | DEF456 | 333 | 01MAR2014:00:00:00 | 31DEC2030:00:00:00 | YES |
+---------------+--------+-----------+--------------------+--------------------+-------------------+
您可以看到多个模态属性 (where multiple_attr = "Y"
) 没有正确处理。
期望的输出应该是这样的:
+---------------+--------+-----------+--------------------+--------------------+--------------+
| multiple_attr | id | attrib_id | start_date | end_date | member_value |
+---------------+--------+-----------+--------------------+--------------------+--------------+
| Y | ABC123 | 1 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | ALONE; BOTH |
| Y | ABC123 | 1 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | ALONE |
| N | ABC123 | 111 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | MEDIUM |
| N | ABC123 | 111 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | BIG |
| N | ABC123 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | TOP |
| N | ABC123 | 333 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | YES |
| Y | DEF456 | 1 | 01MAR2014:00:00:00 | 31MAR2018:00:00:00 | ALONE; BOTH |
| Y | DEF456 | 1 | 01APR2018:00:00:00 | 19AUG2018:00:00:00 | ALONE |
| Y | DEF456 | 1 | 20AUG2018:00:00:00 | 31DEC2030:00:00:00 | ALONE; BOTH |
| N | DEF456 | 111 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MEDIUM |
| N | DEF456 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MID |
| N | DEF456 | 333 | 01MAR2014:00:00:00 | 31DEC2030:00:00:00 | YES |
+---------------+--------+-----------+--------------------+--------------------+--------------+
有没有办法处理多种模态属性?一旦该属性的模式结束(即在结束后从 ALONE; BOTH
切换到 ALONE
),我找不到 delete 成员值的方法。
不是 100% 确定我理解全部,但我认为至少这是一个问题。
查看删除值的位置,由于空格,您需要使用 strip
或类似名称。我删除了 catx()
中的空格并在此处添加 strip()
。
if strip(scan(orig_value,_i,';')) ne strip(in_value) then do;
if strip(orig_value) > strip(scan(orig_value,_i,';')) then new_value = catx(';',scan(orig_value,_i,';'),new_value);
else new_value = catx(';',new_value,scan(orig_value,_i,';'));
end;
否则它会将带空格的词与不带空格的词进行比较,虽然在某些情况下这些词是相同的(或被 SAS 视为相同),但在某些情况下它们不是,这会导致您在此处出现一些问题.例如,当我 运行 这个时,我在第二行得到“Alone”。
我有数据想要转置,以便在任何时间点可视化单个 ID 的状态。
我一直在尝试遵循来自
这是我拥有的基于事件的数据:
data have;
infile datalines delimiter="|";
input attrib :. multiple_attr :. id :. attrib_id :8. member_value :0. type :. dt_event :datetime18.;
format dt_event datetime20.;
datalines;
TYPE|N|ABC123|111|MEDIUM|Start|01DEC2014:00:00:00
TYPE|N|ABC123|111|MEDIUM|End|18APR2021:00:00:00
TYPE|N|ABC123|111|BIG|Start|19APR2021:00:00:00
TYPE|N|ABC123|111|BIG|End|31DEC2030:00:00:00
POSITION|N|ABC123|222|TOP|Start|01DEC2014:00:00:00
POSITION|N|ABC123|222|TOP|End|31DEC2030:00:00:00
IS_ACTIVE|N|ABC123|333|YES|Start|01DEC2014:00:00:00
IS_ACTIVE|N|ABC123|333|YES|End|31DEC2030:00:00:00
LEVELS|Y|ABC123|1|ALONE|Start|01DEC2014:00:00:00
LEVELS|Y|ABC123|1|BOTH|Start|01DEC2014:00:00:00
LEVELS|Y|ABC123|1|BOTH|End|18APR2021:00:00:00
LEVELS|Y|ABC123|1|ALONE|End|31DEC2030:00:00:00
TYPE|N|DEF456|111|MEDIUM|Start|01DEC2014:00:00:00
TYPE|N|DEF456|111|MEDIUM|End|31DEC2030:00:00:00
POSITION|N|DEF456|222|MID|Start|01DEC2014:00:00:00
POSITION|N|DEF456|222|MID|End|31DEC2030:00:00:00
IS_ACTIVE|N|DEF456|333|YES|Start|01MAR2014:00:00:00
IS_ACTIVE|N|DEF456|333|YES|End|31DEC2030:00:00:00
LEVELS|Y|DEF456|1|ALONE|Start|01MAR2014:00:00:00
LEVELS|Y|DEF456|1|BOTH|Start|01MAR2014:00:00:00
LEVELS|Y|DEF456|1|BOTH|End|31MAR2018:00:00:00
LEVELS|Y|DEF456|1|BOTH|Start|20AUG2018:00:00:00
LEVELS|Y|DEF456|1|ALONE|End|31DEC2030:00:00:00
LEVELS|Y|DEF456|1|BOTH|End|31DEC2030:00:00:00
;
使用@Joe 的方法:
proc sort data=have;
by id attrib_id dt_event member_value;
run;
data want;
set have(rename=member_value=in_value);
by id attrib_id dt_event;
retain start_date end_date member_value orig_value;
format member_value new_value 0.;
* First row per attrib_id is easy, just start it off with a START;
if first.attrib_id then do;
start_date = dt_event;
member_value = in_value;
end;
else do; *Now is the harder part;
* For ENDs, we want to remove the current member_value from the concatenated value string, always, and then if it is the last row for that dt_event, we want to output a new record;
if type='End' then do;
*remove the current (in_)value;
if first.dt_event then orig_value = member_value;
do _i = 1 to countw(member_value,';');
if scan(orig_value,_i,';') ne in_value then do;
if orig_value > scan(orig_value,_i,';') then new_value = catx('; ',scan(orig_value,_i,';'),new_value);
else new_value = catx('; ',new_value,scan(orig_value,_i,';'));
end;
end;
orig_value = new_value;
if last.dt_event then do;
end_date = dt_event;
output;
start_date = dt_event + 86400;
member_value = new_value;
orig_value = ' ';
end;
end;
else do;
* For START, we want to be more careful about outputting, as this will output lots of unwanted rows if we do not take care;
end_date = dt_event - 86400;
if start_date < end_date and not missing(member_value) then output;
if member_value > in_value then member_value = catx('; ',in_value,member_value);
else member_value = catx('; ',member_value,in_value);
start_date = dt_event;
end_date = .;
end;
end;
format start_date end_date datetime20.;
keep id multiple_attr attrib_id member_value start_date end_date;
run;
我最终得到:
+---------------+--------+-----------+--------------------+--------------------+-------------------+
| multiple_attr | id | attrib_id | start_date | end_date | member_value |
+---------------+--------+-----------+--------------------+--------------------+-------------------+
| Y | ABC123 | 1 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | ALONE; BOTH |
| Y | ABC123 | 1 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | BOTH; ALONE |
| N | ABC123 | 111 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | MEDIUM |
| N | ABC123 | 111 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | BIG |
| N | ABC123 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | TOP |
| N | ABC123 | 333 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | YES |
| Y | DEF456 | 1 | 01MAR2014:00:00:00 | 31MAR2018:00:00:00 | ALONE; BOTH |
| Y | DEF456 | 1 | 01APR2018:00:00:00 | 19AUG2018:00:00:00 | BOTH; ALONE |
| Y | DEF456 | 1 | 20AUG2018:00:00:00 | 31DEC2030:00:00:00 | BOTH; BOTH; ALONE |
| N | DEF456 | 111 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MEDIUM |
| N | DEF456 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MID |
| N | DEF456 | 333 | 01MAR2014:00:00:00 | 31DEC2030:00:00:00 | YES |
+---------------+--------+-----------+--------------------+--------------------+-------------------+
您可以看到多个模态属性 (where multiple_attr = "Y"
) 没有正确处理。
期望的输出应该是这样的:
+---------------+--------+-----------+--------------------+--------------------+--------------+
| multiple_attr | id | attrib_id | start_date | end_date | member_value |
+---------------+--------+-----------+--------------------+--------------------+--------------+
| Y | ABC123 | 1 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | ALONE; BOTH |
| Y | ABC123 | 1 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | ALONE |
| N | ABC123 | 111 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | MEDIUM |
| N | ABC123 | 111 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | BIG |
| N | ABC123 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | TOP |
| N | ABC123 | 333 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | YES |
| Y | DEF456 | 1 | 01MAR2014:00:00:00 | 31MAR2018:00:00:00 | ALONE; BOTH |
| Y | DEF456 | 1 | 01APR2018:00:00:00 | 19AUG2018:00:00:00 | ALONE |
| Y | DEF456 | 1 | 20AUG2018:00:00:00 | 31DEC2030:00:00:00 | ALONE; BOTH |
| N | DEF456 | 111 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MEDIUM |
| N | DEF456 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MID |
| N | DEF456 | 333 | 01MAR2014:00:00:00 | 31DEC2030:00:00:00 | YES |
+---------------+--------+-----------+--------------------+--------------------+--------------+
有没有办法处理多种模态属性?一旦该属性的模式结束(即在结束后从 ALONE; BOTH
切换到 ALONE
),我找不到 delete 成员值的方法。
不是 100% 确定我理解全部,但我认为至少这是一个问题。
查看删除值的位置,由于空格,您需要使用 strip
或类似名称。我删除了 catx()
中的空格并在此处添加 strip()
。
if strip(scan(orig_value,_i,';')) ne strip(in_value) then do;
if strip(orig_value) > strip(scan(orig_value,_i,';')) then new_value = catx(';',scan(orig_value,_i,';'),new_value);
else new_value = catx(';',new_value,scan(orig_value,_i,';'));
end;
否则它会将带空格的词与不带空格的词进行比较,虽然在某些情况下这些词是相同的(或被 SAS 视为相同),但在某些情况下它们不是,这会导致您在此处出现一些问题.例如,当我 运行 这个时,我在第二行得到“Alone”。