如何使用两个变量重塑

How to reshape using two variables

假设我有这个数据:

group    obs    data    data_A    data_B
1        1      7_a     7_a       
1        2      4_b               4_b  
1        3      1_a     1_a     
2        1      5_b               5_b
3        1                  
4        1      3_b               3_b
4        2      4_b               4_b
4        3      9_a     9_a     
4        4      8_b               8_b   

data_Adata_B是基于data构建的。他们遵循的规则是,如果 dataa 结束 data_A 并且 b 结束 data_B,则它们取值 data;如果数据为空,data_Adata_B 都保持为空。

我想将数据重塑如下:

group    data_A1    data_A2    data_B1    data_B2    data_B3
1        7_a        1_a        4_b                     
2                              5_b              
3                                            
4        9_a                   3_b        4_b         8_b    

其中列数由值的数量自动确定。

7_a9_adata_A1 中,因为它们是各自组中 a 变量的第一个实例。 1_adata_A2 中,因为它是其组中 a 变量的第二个实例,依此类推。

如何做到这一点?

(我知道 reshape 并且可以在类似情况下使用它。)

一种方法是使用循环。不是很优雅,但它有效。

clear
set more off

*----- example data -----

input ///
group    obs    str3(data    data_A    data_B)
1        1      7_a     7_a           ""
1        2      4_b       ""        4_b  
1        3      1_a     1_a          ""
2        1      5_b      ""         5_b
3        1       ""        ""       ""
4        1      3_b       ""        3_b
4        2      4_b       ""        4_b
4        3      9_a     9_a          ""
4        4      8_b       ""        8_b   
end

drop data
list, sepby(group)

*----- what you want -----

quietly foreach i in A B {

    bysort group (obs) : gen count_`i' = sum(!missing(data_`i'))
    summarize count_`i', meanonly

    forvalues j = 1/`r(max)' {
        gen data_`i'`j' = ""
        replace data_`i'`j' = data_`i' if count_`i' == `j'
    }

    drop count_`i'
}

drop data_?

collapse (firstnm) data_*, by(group)

list

另一种方法使用 reshapes 和 fillin:

clear
set more off

*----- example data -----

input ///
group    obs    str3(data    data_A    data_B)
1        1      7_a     7_a           ""
1        2      4_b       ""        4_b  
1        3      1_a     1_a          ""
2        1      5_b      ""         5_b
3        1       ""        ""       ""
4        1      3_b       ""        3_b
4        2      4_b       ""        4_b
4        3      9_a     9_a          ""
4        4      8_b       ""        8_b   
end

drop data

list, sepby(group)

*----- what you want -----

// first reshape
reshape long data_ , i(group obs) j(j) string

// counts per group j
bysort group j (obs) : gen count = sum(!missing(data_))

// concatenate and rectangularize
gen j2 = j + string(count)
fillin group j2

// drop some observations
bysort group j2 (data_) : drop if _n < _N | inlist(j2, "A0", "B0")

// keep necessary variables
keep group j2 data_

// second reshape
reshape wide data_, i(group) j(j2) string

list

我发现循环的解决方案更直观。

您想要的数据结构很奇怪。插入一些上下文和你的最终目标总是一个好主意。

我同意罗伯托的观点,这是一件很奇怪的事情。这是到达那里的另一种有趣方式:

clear
input float(group obs) str3(data data_A data_B)
1 1 "7_a" "7_a" "" 
1 2 "4_b" "" "4_b" 
1 3 "1_a" "1_a" "" 
2 1 "5_b" "" "5_b" 
3 1 "" "" "" 
4 1 "3_b" "" "3_b" 
4 2 "4_b" "" "4_b" 
4 3 "9_a" "9_a" "" 
4 4 "8_b" "" "8_b" 
end

* verify assumptions about the data
isid group obs, sort

* concatenate values across obs
by group (obs): replace data_A = data_A[_n-1] + " " + data_A
by group (obs): replace data_B = data_B[_n-1] + " " + data_B

* the last obs of the group contains all values
by group: keep if _n == _N

* split each concatenated string
split data_A
split data_B

drop obs data data_A data_B
list