如何使用两个变量重塑
How to reshape using two variables
假设我有这个数据:
group obs data data_A data_B
1 1 7_a 7_a
1 2 4_b 4_b
1 3 1_a 1_a
2 1 5_b 5_b
3 1
4 1 3_b 3_b
4 2 4_b 4_b
4 3 9_a 9_a
4 4 8_b 8_b
data_A
和data_B
是基于data
构建的。他们遵循的规则是,如果 data
以 a
结束 data_A
并且 b
结束 data_B
,则它们取值 data
;如果数据为空,data_A
和 data_B
都保持为空。
我想将数据重塑如下:
group data_A1 data_A2 data_B1 data_B2 data_B3
1 7_a 1_a 4_b
2 5_b
3
4 9_a 3_b 4_b 8_b
其中列数由值的数量自动确定。
7_a
和 9_a
在 data_A1
中,因为它们是各自组中 a
变量的第一个实例。 1_a
在 data_A2
中,因为它是其组中 a
变量的第二个实例,依此类推。
如何做到这一点?
(我知道 reshape
并且可以在类似情况下使用它。)
一种方法是使用循环。不是很优雅,但它有效。
clear
set more off
*----- example data -----
input ///
group obs str3(data data_A data_B)
1 1 7_a 7_a ""
1 2 4_b "" 4_b
1 3 1_a 1_a ""
2 1 5_b "" 5_b
3 1 "" "" ""
4 1 3_b "" 3_b
4 2 4_b "" 4_b
4 3 9_a 9_a ""
4 4 8_b "" 8_b
end
drop data
list, sepby(group)
*----- what you want -----
quietly foreach i in A B {
bysort group (obs) : gen count_`i' = sum(!missing(data_`i'))
summarize count_`i', meanonly
forvalues j = 1/`r(max)' {
gen data_`i'`j' = ""
replace data_`i'`j' = data_`i' if count_`i' == `j'
}
drop count_`i'
}
drop data_?
collapse (firstnm) data_*, by(group)
list
另一种方法使用 reshape
s 和 fillin
:
clear
set more off
*----- example data -----
input ///
group obs str3(data data_A data_B)
1 1 7_a 7_a ""
1 2 4_b "" 4_b
1 3 1_a 1_a ""
2 1 5_b "" 5_b
3 1 "" "" ""
4 1 3_b "" 3_b
4 2 4_b "" 4_b
4 3 9_a 9_a ""
4 4 8_b "" 8_b
end
drop data
list, sepby(group)
*----- what you want -----
// first reshape
reshape long data_ , i(group obs) j(j) string
// counts per group j
bysort group j (obs) : gen count = sum(!missing(data_))
// concatenate and rectangularize
gen j2 = j + string(count)
fillin group j2
// drop some observations
bysort group j2 (data_) : drop if _n < _N | inlist(j2, "A0", "B0")
// keep necessary variables
keep group j2 data_
// second reshape
reshape wide data_, i(group) j(j2) string
list
我发现循环的解决方案更直观。
您想要的数据结构很奇怪。插入一些上下文和你的最终目标总是一个好主意。
我同意罗伯托的观点,这是一件很奇怪的事情。这是到达那里的另一种有趣方式:
clear
input float(group obs) str3(data data_A data_B)
1 1 "7_a" "7_a" ""
1 2 "4_b" "" "4_b"
1 3 "1_a" "1_a" ""
2 1 "5_b" "" "5_b"
3 1 "" "" ""
4 1 "3_b" "" "3_b"
4 2 "4_b" "" "4_b"
4 3 "9_a" "9_a" ""
4 4 "8_b" "" "8_b"
end
* verify assumptions about the data
isid group obs, sort
* concatenate values across obs
by group (obs): replace data_A = data_A[_n-1] + " " + data_A
by group (obs): replace data_B = data_B[_n-1] + " " + data_B
* the last obs of the group contains all values
by group: keep if _n == _N
* split each concatenated string
split data_A
split data_B
drop obs data data_A data_B
list
假设我有这个数据:
group obs data data_A data_B
1 1 7_a 7_a
1 2 4_b 4_b
1 3 1_a 1_a
2 1 5_b 5_b
3 1
4 1 3_b 3_b
4 2 4_b 4_b
4 3 9_a 9_a
4 4 8_b 8_b
data_A
和data_B
是基于data
构建的。他们遵循的规则是,如果 data
以 a
结束 data_A
并且 b
结束 data_B
,则它们取值 data
;如果数据为空,data_A
和 data_B
都保持为空。
我想将数据重塑如下:
group data_A1 data_A2 data_B1 data_B2 data_B3
1 7_a 1_a 4_b
2 5_b
3
4 9_a 3_b 4_b 8_b
其中列数由值的数量自动确定。
7_a
和 9_a
在 data_A1
中,因为它们是各自组中 a
变量的第一个实例。 1_a
在 data_A2
中,因为它是其组中 a
变量的第二个实例,依此类推。
如何做到这一点?
(我知道 reshape
并且可以在类似情况下使用它。)
一种方法是使用循环。不是很优雅,但它有效。
clear
set more off
*----- example data -----
input ///
group obs str3(data data_A data_B)
1 1 7_a 7_a ""
1 2 4_b "" 4_b
1 3 1_a 1_a ""
2 1 5_b "" 5_b
3 1 "" "" ""
4 1 3_b "" 3_b
4 2 4_b "" 4_b
4 3 9_a 9_a ""
4 4 8_b "" 8_b
end
drop data
list, sepby(group)
*----- what you want -----
quietly foreach i in A B {
bysort group (obs) : gen count_`i' = sum(!missing(data_`i'))
summarize count_`i', meanonly
forvalues j = 1/`r(max)' {
gen data_`i'`j' = ""
replace data_`i'`j' = data_`i' if count_`i' == `j'
}
drop count_`i'
}
drop data_?
collapse (firstnm) data_*, by(group)
list
另一种方法使用 reshape
s 和 fillin
:
clear
set more off
*----- example data -----
input ///
group obs str3(data data_A data_B)
1 1 7_a 7_a ""
1 2 4_b "" 4_b
1 3 1_a 1_a ""
2 1 5_b "" 5_b
3 1 "" "" ""
4 1 3_b "" 3_b
4 2 4_b "" 4_b
4 3 9_a 9_a ""
4 4 8_b "" 8_b
end
drop data
list, sepby(group)
*----- what you want -----
// first reshape
reshape long data_ , i(group obs) j(j) string
// counts per group j
bysort group j (obs) : gen count = sum(!missing(data_))
// concatenate and rectangularize
gen j2 = j + string(count)
fillin group j2
// drop some observations
bysort group j2 (data_) : drop if _n < _N | inlist(j2, "A0", "B0")
// keep necessary variables
keep group j2 data_
// second reshape
reshape wide data_, i(group) j(j2) string
list
我发现循环的解决方案更直观。
您想要的数据结构很奇怪。插入一些上下文和你的最终目标总是一个好主意。
我同意罗伯托的观点,这是一件很奇怪的事情。这是到达那里的另一种有趣方式:
clear
input float(group obs) str3(data data_A data_B)
1 1 "7_a" "7_a" ""
1 2 "4_b" "" "4_b"
1 3 "1_a" "1_a" ""
2 1 "5_b" "" "5_b"
3 1 "" "" ""
4 1 "3_b" "" "3_b"
4 2 "4_b" "" "4_b"
4 3 "9_a" "9_a" ""
4 4 "8_b" "" "8_b"
end
* verify assumptions about the data
isid group obs, sort
* concatenate values across obs
by group (obs): replace data_A = data_A[_n-1] + " " + data_A
by group (obs): replace data_B = data_B[_n-1] + " " + data_B
* the last obs of the group contains all values
by group: keep if _n == _N
* split each concatenated string
split data_A
split data_B
drop obs data data_A data_B
list