扩展和复制

expand and duplicates

我有一个数据集如下。

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 id str8 drug1 str3(drug2 drug3)
"pat"  "thiazide" "BB"  "CCB"
"ann"  "thiazide" "ace" ""   
"mary" "ace"      ""    ""   
"john" "ace"      ""    ""   
end

我想为每个人为他们拥有的每种药物创建一个单独的行。 reshape 绝对不是我想要的:我一直在试验 expand 并认为这是解决方案....,除了一些我做不对的小事。我想我需要 expand 然后删除重复项。

第 1 步:

这是我用来获得我想要的东西的代码,它工作正常,除了帕特:他的第三种药物没有复制到他的第三行。

expand 3
by id, sort: generate drug = cond(_n == 1,drug1, drug2, drug3)

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 id str8 drug1 str3(drug2 drug3) str8 drug
"ann"  "thiazide" "ace" ""    "thiazide"
"ann"  "thiazide" "ace" ""    "ace"     
"ann"  "thiazide" "ace" ""    "ace"     
"john" "ace"      ""    ""    "ace"     
"john" "ace"      ""    ""    ""        
"john" "ace"      ""    ""    ""        
"mary" "ace"      ""    ""    "ace"     
"mary" "ace"      ""    ""    ""        
"mary" "ace"      ""    ""    ""        
"pat"  "thiazide" "BB"  "CCB" "thiazide"
"pat"  "thiazide" "BB"  "CCB" "BB"      
"pat"  "thiazide" "BB"  "CCB" "BB"      
end

如果有人能指导我如何解决这个问题,那就太好了。

第 2 步: 对于第二步(假设 pat 的行对此是正确的),我想删除重复项,以便根据每个人不同的药物数量,我只剩下正确的行数。例如,pat 的 none 行应该是重复的,所以我想保留他的所有行。但是 ann 有一个重复的行,我需要将其删除。

这是我用过的:

 bys id drug: gen dup2=cond(_N==1,0,_n)
drop if dup2>1

这没问题,但我还为玛丽和约翰留下了额外的行。我处理这些使用:

drop if drug==""

这是最efficient/least 容易出错的方法吗?

修正案 事实证明,我的玩具数据集过于简单,无法反映我的真实数据。我的实际数据已经很长了,所以这就是 reshape 在这里不起作用的原因。我很高兴得到纠正,但我认为 expand 可能是正确的选择。除了,现在,当我尝试 expand 更复杂的数据时,我无法弄清楚如何使循环生成我需要的数据集(基本上,每人每种药物一次观察)。这是我所拥有的示例:

clear
input str4 id int day str8 drug1 str3(drug2 drug3)
"ann"   14 "thiazide" "ace" ""   
"ann"   70 "thiazide" "ace" ""   
"ann"    1 "CCB"      ""    ""   
"ann"   35 "thiazide" "ace" ""   
"ann "  30 "CCB"      ""    ""   
"john"   1 "ace"      ""    ""   
"john"  30 "CCB"      ""    ""   
"john" 150 "ace"      ""    ""   
"john"  60 "ace"      ""    ""   
"john"  60 "CCB"      ""    ""   
"john"  30 "ace"      ""    ""   
"john"   1 "CCB"      ""    ""   
"mary"  30 "ace"      ""    ""   
"mary"   1 "ace"      ""    ""   
"mary" 115 "thiazide" ""    ""   
"mary"  60 "ace"      ""    ""   
"mary"  90 "ace"      ""    ""   
"mary" 120 "ace"      ""    ""   
"pat"   30 "thiazide" "BB"  "CCB"
"pat"    1 "ace"      ""    ""   
"pat"   30 "ace"      ""    ""   
"pat"    1 "thiazide" "BB"  "CCB"
end

使用后:

expand 3

这是我想要的示例,但不确定如何编写代码来获得它。我尝试使用下面尼克考克斯循环的变体;但我没有做对。

clear
input str4 id int day str8 drug1 str3(drug2 drug3) str8 drug
"ann"    1 "CCB"      ""    ""    "CCB"     
"ann"    1 "CCB"      ""    ""    ""        
"ann"    1 "CCB"      ""    ""    ""        
"ann"   14 "thiazide" "ace" ""    "thiazide"
"ann"   14 "thiazide" "ace" ""    "ace"     
"ann"   14 "thiazide" "ace" ""    ""        
"ann"   35 "thiazide" "ace" ""    "thiazide"
"ann"   35 "thiazide" "ace" ""    "ace"     
"ann"   35 "thiazide" "ace" ""    ""        
"ann"   70 "thiazide" "ace" ""    "thiazide"
"ann"   70 "thiazide" "ace" ""    "ace"     
"ann"   70 "thiazide" "ace" ""    ""        
"ann "  30 "CCB"      ""    ""    "CCB"     
"ann "  30 "CCB"      ""    ""    ""        
"ann "  30 "CCB"      ""    ""    ""        
"john"   1 "CCB"      ""    ""    "CCB"     
"john"   1 "CCB"      ""    ""    ""        
"john"   1 "CCB"      ""    ""    ""        
"john"   1 "ace"      ""    ""    "ace"     
"john"   1 "ace"      ""    ""    ""        
"john"   1 "ace"      ""    ""    ""        
"john"  30 "CCB"      ""    ""    "CCB"     
"john"  30 "CCB"      ""    ""    ""        
"john"  30 "CCB"      ""    ""    ""        
"john"  30 "ace"      ""    ""    "ace"     
"john"  30 "ace"      ""    ""    ""        
"john"  30 "ace"      ""    ""    ""        
"john"  60 "CCB"      ""    ""    "CCB"     
"john"  60 "CCB"      ""    ""    ""        
"john"  60 "CCB"      ""    ""    ""        
"john"  60 "ace"      ""    ""    "ace"     
"john"  60 "ace"      ""    ""    ""        
"john"  60 "ace"      ""    ""    ""        
"john" 150 "ace"      ""    ""    "ace"     
"john" 150 "ace"      ""    ""    ""        
"john" 150 "ace"      ""    ""    ""        
"mary"   1 "ace"      ""    ""    "ace"     
"mary"   1 "ace"      ""    ""    ""        
"mary"   1 "ace"      ""    ""    ""        
"mary"  30 "ace"      ""    ""    "ace"     
"mary"  30 "ace"      ""    ""    ""        
"mary"  30 "ace"      ""    ""    ""        
"mary"  60 "ace"      ""    ""    "ace"     
"mary"  60 "ace"      ""    ""    ""        
"mary"  60 "ace"      ""    ""    ""        
"mary"  90 "ace"      ""    ""    "ace"     
"mary"  90 "ace"      ""    ""    ""        
"mary"  90 "ace"      ""    ""    ""        
"mary" 115 "thiazide" ""    ""    "thiazide"
"mary" 115 "thiazide" ""    ""    ""        
"mary" 115 "thiazide" ""    ""    ""        
"mary" 120 "ace"      ""    ""    "ace"     
"mary" 120 "ace"      ""    ""    ""        
"mary" 120 "ace"      ""    ""    ""        
"pat"    1 "ace"      ""    ""    "ace"     
"pat"    1 "ace"      ""    ""    ""        
"pat"    1 "ace"      ""    ""    ""        
"pat"    1 "thiazide" "BB"  "CCB" "thiazide"
"pat"    1 "thiazide" "BB"  "CCB" "BB"      
"pat"    1 "thiazide" "BB"  "CCB" "CCB"     
"pat"   30 "ace"      ""    ""    "ace"     
"pat"   30 "ace"      ""    ""    ""        
"pat"   30 "ace"      ""    ""    ""        
"pat"   30 "thiazide" "BB"  "CCB" "thiazide"
"pat"   30 "thiazide" "BB"  "CCB" "BB"      
"pat"   30 "thiazide" "BB"  "CCB" "CCB"     
end

此时我可以删除具有缺失值的观测值,并清理数据集以获得以下内容:

drop if missing(drug)
drop drug?



clear
input str4 id int day str8 drug
"ann"    1 "CCB"     
"ann"   14 "thiazide"
"ann"   14 "ace"     
"ann"   35 "thiazide"
"ann"   35 "ace"     
"ann"   70 "thiazide"
"ann"   70 "ace"     
"ann "  30 "CCB"     
"john"   1 "CCB"     
"john"   1 "ace"     
"john"  30 "CCB"     
"john"  30 "ace"     
"john"  60 "CCB"     
"john"  60 "ace"     
"john" 150 "ace"     
"mary"   1 "ace"     
"mary"  30 "ace"     
"mary"  60 "ace"     
"mary"  90 "ace"     
"mary" 115 "thiazide"
"mary" 120 "ace"     
"pat"    1 "ace"     
"pat"    1 "thiazide"
"pat"    1 "BB"      
"pat"    1 "CCB"     
"pat"   30 "ace"     
"pat"   30 "thiazide"
"pat"   30 "BB"      
"pat"   30 "CCB"     
end

我对 reshape 在没有论据或证据的情况下被解雇感到困惑。 reshape 直接带你到那里,除了一行来清除遗漏。

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 id str8 drug1 str3(drug2 drug3)
"pat"  "thiazide" "BB"  "CCB"
"ann"  "thiazide" "ace" ""   
"mary" "ace"      ""    ""   
"john" "ace"      ""    ""   
end

reshape long drug, i(id) j(seq) 
drop if missing(drug) 
list, sepby(id) 

     +-----------------------+
     |   id   seq       drug |
     |-----------------------|
  1. |  ann     1   thiazide |
  2. |  ann     2        ace |
     |-----------------------|
  3. | john     1        ace |
     |-----------------------|
  4. | mary     1        ace |
     |-----------------------|
  5. |  pat     1   thiazide |
  6. |  pat     2         BB |
  7. |  pat     3        CCB |
     +-----------------------+

编辑:

您从 expand 开始的想法可以很容易地实现。在幕后 reshape 正在做类似的事情。

clear
input str4 id str8 drug1 str3(drug2 drug3)
"pat"  "thiazide" "BB"  "CCB"
"ann"  "thiazide" "ace" ""   
"mary" "ace"      ""    ""   
"john" "ace"      ""    ""   
end
expand 3 
sort id 
gen drug = "" 
quietly forval j = 1/3 { 
     by id: replace drug = drug`j' if _n == `j' 
} 
drop if missing(drug) 
drop drug? 
list, sepby(id) 

编辑 2

额外的并发症只是并发症,并不意味着不同的方法。您需要更大的信心并了解 reshape 比您想象的要多才多艺!参见例如the FAQ here 以及帮助和手动输入。

平凡地,我假设 "Ann " 只是 "Ann" 的拼写错误。那么我们所拥有的不仅是同一个人的不同日子,而且某些人和日子也会以某种方式重复。这意味着更完整地拼写标识符;事实上我们需要一个额外的变量。引用的 FAQ 中讨论了有时需要新的标识符变量来拼写默认命令的原则,即使是任意的。 "long long" 布局是可能的想法也是一个标准概念。

clear
input str4 id int day str8 drug1 str3(drug2 drug3)
"ann"   14 "thiazide" "ace" ""   
"ann"   70 "thiazide" "ace" ""   
"ann"    1 "CCB"      ""    ""   
"ann"   35 "thiazide" "ace" ""   
"ann "  30 "CCB"      ""    ""   
"john"   1 "ace"      ""    ""   
"john"  30 "CCB"      ""    ""   
"john" 150 "ace"      ""    ""   
"john"  60 "ace"      ""    ""   
"john"  60 "CCB"      ""    ""   
"john"  30 "ace"      ""    ""   
"john"   1 "CCB"      ""    ""   
"mary"  30 "ace"      ""    ""   
"mary"   1 "ace"      ""    ""   
"mary" 115 "thiazide" ""    ""   
"mary"  60 "ace"      ""    ""   
"mary"  90 "ace"      ""    ""   
"mary" 120 "ace"      ""    ""   
"pat"   30 "thiazide" "BB"  "CCB"
"pat"    1 "ace"      ""    ""   
"pat"   30 "ace"      ""    ""   
"pat"    1 "thiazide" "BB"  "CCB"
end

replace id = trim(id) 
bysort id day : gen SEQ = _n 
reshape long drug, i(id day SEQ) j(seq) 
drop if missing(drug) 
list, sepby(id) 

    +-----------------------------------+
     |   id   day   SEQ   seq       drug |
     |-----------------------------------|
  1. |  ann     1     1     1        CCB |
  2. |  ann    14     1     1   thiazide |
  3. |  ann    14     1     2        ace |
  4. |  ann    30     1     1        CCB |
  5. |  ann    35     1     1   thiazide |
  6. |  ann    35     1     2        ace |
  7. |  ann    70     1     1   thiazide |
  8. |  ann    70     1     2        ace |
     |-----------------------------------|
  9. | john     1     1     1        ace |
 10. | john     1     2     1        CCB |
 11. | john    30     1     1        ace |
 12. | john    30     2     1        CCB |
 13. | john    60     1     1        ace |
 14. | john    60     2     1        CCB |
 15. | john   150     1     1        ace |
     |-----------------------------------|
 16. | mary     1     1     1        ace |
 17. | mary    30     1     1        ace |
 18. | mary    60     1     1        ace |
 19. | mary    90     1     1        ace |
 20. | mary   115     1     1   thiazide |
 21. | mary   120     1     1        ace |
     |-----------------------------------|
 22. |  pat     1     1     1        ace |
 23. |  pat     1     2     1   thiazide |
 24. |  pat     1     2     2         BB |
 25. |  pat     1     2     3        CCB |
 26. |  pat    30     1     1        ace |
 27. |  pat    30     2     1   thiazide |
 28. |  pat    30     2     2         BB |
 29. |  pat    30     2     3        CCB |
     +-----------------------------------+

这是我对更复杂数据所做的努力 - 似乎工作正常,但很高兴得到更正。或者如果有其他更好的方法,请post!

这里是玩具数据

clear
input str4 id int day str8 drug1 str3(drug2 drug3)
"pat"    1 "thiazide" "BB"  "CCB"
"pat"    1 "ace"      ""    ""   
"pat"   30 "ace"      ""    ""   
"pat"   30 "thiazide" "BB"  "CCB"
"ann"    1 "CCB"      ""    ""   
"ann"   14 "thiazide" "ace" ""   
"ann "  30 "CCB"      ""    ""   
"ann"   35 "thiazide" "ace" ""   
"ann"   70 "thiazide" "ace" ""   
"mary"   1 "ace"      ""    ""   
"mary"  30 "ace"      ""    ""   
"mary"  60 "ace"      ""    ""   
"mary"  90 "ace"      ""    ""   
"mary" 115 "thiazide" ""    ""   
"mary" 120 "ace"      ""    ""   
"john" 150 "ace"      ""    ""   
"john"   1 "CCB"      ""    ""   
"john"   1 "ace"      ""    ""   
"john"  30 "CCB"      ""    ""   
"john"  30 "ace"      ""    ""   
"john"  60 "CCB"      ""    ""   
"john"  60 "ace"      ""    ""   
end

代码在这里:

expand 3                            
gen drug=""
sort id day
egen group=group(id day drug1)      
bys id group: gen count=_n

 forval j = 1/3 { 
       bys id group: replace drug = drug`j' if count == `j' 
       }


drop if missing(drug)
drop drug? count group

NJC 简化:

expand 3                            
gen drug = ""

forval j = 1/3 { 
      by id day drug1: replace drug = drug`j' if _n == `j' 
}

drop if missing(drug)
drop drug?