
calculating durations across variables

我在 Stata 中的数据是这样的:

day1    day1_dt     day2    day2_dt     day3    day3_dt     day4    day4_dt     
0       2010-01-02  4       2010-01-03  .       2010-01-04  2       2010-01-05  
.       2011-05-02  3       2011-05-03  4       2011-05-04  4       2011-05-05  
5       2012-01-05  4       2012-01-06  4       2012-01-07  4       2012-01-08  
4       2015-05-02  4       2015-05-03  4       2015-05-04  4       2015-05-05  
1       2009-05-02  4       2009-05-03  0       2009-05-04  4       2009-05-05  


  1. dayX 变量中有 4 个时的天数持续时间。


generate int flg1 =1 if day1 == 4
generate int flg2 =1 if day2 == 4
generate int flg3 =1 if day3 == 4
generate int flg4 =1 if day4 == 4

egen duration = rowtotal(flg*)
  1. 找出4的值没有more/has改变的日期记录在end_date


day1    day1_dt     day2    day2_dt     day3    day3_dt     day4    day4_dt     duration    end_date
0       2010-01-02  4       2010-01-03  .       2010-01-04  2       2010-01-05  1           2010-01-04          
.       2011-05-02  3       2011-05-03  4       2011-05-04  4       2011-05-05  2           .
5       2012-01-05  4       2012-01-06  4       2012-01-07  4       2012-01-08  3           .
4       2015-05-02  4       2015-05-03  4       2015-05-04  4       2015-05-05  4           .
1       2009-05-02  4       2009-05-03  0       2009-05-04  4       2009-05-05  2           .

你最后一个例子的第二行似乎有错字。如果不是,那么请解释为什么你希望 duration 是 1 而不是 2。


// This is best practice way of sharing data examples in Stata on Whosebug

* Example generated by -dataex-. For more info, type help dataex
input byte day1 int day1_dt byte day2 int day2_dt byte day3 int day3_dt byte day4 int day4_dt
0 18264 4 18265 . 18266 2 18267
. 18749 3 18750 4 18751 4 18752
5 18997 4 18998 4 18999 4 19000
4 20210 4 20211 4 20212 4 20213
1 18019 4 18020 0 18021 4 18022
format %tdnn/dd/CCYY day1_dt
format %tdnn/dd/CCYY day2_dt
format %tdnn/dd/CCYY day3_dt
format %tdnn/dd/CCYY day4_dt

// This is your solution

* Count number of day1, day2 etc vars with value 4
egen duration = anycount(day?), values(4)

从 Stata 的角度来看,您似乎在宽布局中保存面板或纵向数据。正如您所发现的,这使得即使是简单的任务也变得相当复杂。我建议使用 reshape.


请参阅 Stata 标签 wiki,了解如何将数据示例作为 Stata 代码(简短说明:使用 dataex 命令)。您的示例非常清楚,但需要猜测您的日期类型,因为显示格式可能是 YMD 或 YDM。我猜的是一种方式,但另一种方式的原理是一样的。如果你的日期变量真的是字符串,你需要通过 daily() 推送它们来做任何有用的事情。

脚本和输出如下。您还需要为 enddate 指定一种显示格式。

* Example generated by -dataex-. For more info, type help dataex
input byte(day1 day2 day3 day4) float(day1_dt day2_dt day3_dt day4_dt)
0 4 . 2 18264 18265 18266 18267
. 3 4 4 18749 18750 18751 18752
5 4 4 4 18997 18998 18999 19000
4 4 4 4 20210 20211 20212 20213
1 4 0 4 18019 18020 18021 18022
format %tdCY-N-D day1_dt
format %tdCY-N-D day2_dt
format %tdCY-N-D day3_dt 
format %tdCY-N-D day4_dt

gen long id = _n 
reshape long day day@_dt, i(id)

egen duration = total(day == 4), by(id)

egen enddate = max(cond(day == 4, day_dt, .)), by(id)
egen whenlast = max(day_dt), by(id)
replace enddate = . if enddate == whenlast 

list, sepby(id)
. clear

. input byte(day1 day2 day3 day4) float(day1_dt day2_dt day3_dt day4_dt)

         day1      day2      day3      day4    day1_dt    day2_dt    day3_dt    day4_dt
  1. 0 4 . 2 18264 18265 18266 18267
  2. . 3 4 4 18749 18750 18751 18752
  3. 5 4 4 4 18997 18998 18999 19000
  4. 4 4 4 4 20210 20211 20212 20213
  5. 1 4 0 4 18019 18020 18021 18022
  6. end

. format %tdCY-N-D day1_dt

. format %tdCY-N-D day2_dt

. format %tdCY-N-D day3_dt 

. format %tdCY-N-D day4_dt

. gen long id = _n 

. reshape long day day@_dt, i(id)
(j = 1 2 3 4)

Data                               Wide   ->   Long
Number of observations                5   ->   20          
Number of variables                   9   ->   4           
j variable (4 values)                     ->   _j
xij variables:
                     day1 day2 ... day4   ->   day
            day1_dt day2_dt ... day4_dt   ->   day_dt

. egen duration = total(day == 4), by(id)

. egen enddate = max(cond(day == 4, day_dt, .)), by(id)

. egen whenlast = max(day_dt), by(id)

. replace enddate = . if enddate == whenlast 
(16 real changes made, 16 to missing)

. list, sepby(id)

     | id   _j   day       day_dt   duration   enddate   whenlast |
  1. |  1    1     0   2010-01-02          1     18265      18267 |
  2. |  1    2     4   2010-01-03          1     18265      18267 |
  3. |  1    3     .   2010-01-04          1     18265      18267 |
  4. |  1    4     2   2010-01-05          1     18265      18267 |
  5. |  2    1     .   2011-05-02          2         .      18752 |
  6. |  2    2     3   2011-05-03          2         .      18752 |
  7. |  2    3     4   2011-05-04          2         .      18752 |
  8. |  2    4     4   2011-05-05          2         .      18752 |
  9. |  3    1     5   2012-01-05          3         .      19000 |
 10. |  3    2     4   2012-01-06          3         .      19000 |
 11. |  3    3     4   2012-01-07          3         .      19000 |
 12. |  3    4     4   2012-01-08          3         .      19000 |
 13. |  4    1     4   2015-05-02          4         .      20213 |
 14. |  4    2     4   2015-05-03          4         .      20213 |
 15. |  4    3     4   2015-05-04          4         .      20213 |
 16. |  4    4     4   2015-05-05          4         .      20213 |
 17. |  5    1     1   2009-05-02          2         .      18022 |
 18. |  5    2     4   2009-05-03          2         .      18022 |
 19. |  5    3     0   2009-05-04          2         .      18022 |
 20. |  5    4     4   2009-05-05          2         .      18022 |

这是对@TheIceBear 的回答的sequel,展示了如何在保持相同布局的情况下回答问题 2。

input byte(day1 day2 day3 day4) float(day1_dt day2_dt day3_dt day4_dt)
0 4 . 2 18264 18265 18266 18267
. 3 4 4 18749 18750 18751 18752
5 4 4 4 18997 18998 18999 19000
4 4 4 4 20210 20211 20212 20213
1 4 0 4 18019 18020 18021 18022
format %tdCY-N-D day1_dt
format %tdCY-N-D day2_dt
format %tdCY-N-D day3_dt 
format %tdCY-N-D day4_dt

gen enddate = . 

* 1/4 is contingent on day*_dt running over 1 to 4 
* and on those variables being in date order 
forval j = 1/4 { 
    replace enddate = day`j'_dt if day`j' == 4 

egen whenlast = rowmax(day*_dt) 

replace enddate = . if enddate == whenlast 

format enddate whenlast %td 

list enddate whenlast