Stata 查询:需要帮助创建一个新变量,该变量依赖于同一家庭中不同行的数据
Stata query: Need help creating a new variable dependent on data from a different row within same household
我想在我的横截面调查数据集中创建一个新列,其中包括女性丈夫的教育程度。我有家庭(hid)和个人(HL1)的ID,以及以下信息:
- MA1==女性是否已婚(仅女性可观察数据)
- MA2==丈夫的年龄(数据仅对已婚女性可见)
- HL4== 性别(所有个体均可观察到的数据)
- HL6== 年龄(所有个体的可观察数据)
- ED4A== 最高教育水平(所有人都可观察到的数据)
本质上,我想创建代码来执行以下操作:
- 先看老婆是否已婚(MA1)
- 如果是,再看老公年龄(MA2)
- 然后将丈夫的年龄 (MA2) 与家庭中男性的年龄 (HL6) 配对
- 然后查看男性的教育程度 (ED4A) 并将该教育程度放在新的列中,但与女性的行号在同一行。
我试过了,但没用:
bysort hid (HL6) : gen husb_educ = ED4A[MA2]
下面是来自数据集的示例:
+-----+----------+-----+-----+--------+-----+----------+
| HL1 | MA1 | MA2 | hid | HL4 | HL6 | ED4A |
+-----+----------+-----+-----+--------+-----+----------+
| 1 | | | 106 | Male | 57 | Diploma |
| 2 | | | 106 | Female | 53 | Intermed |
| 3 | | | 106 | Male | 30 | Higher S |
| 4 | No, not | | 106 | Female | 24 | Bachelor |
| 5 | | | 106 | Male | 22 | Diploma |
| 6 | | | 106 | Male | 17 | Secondar |
| 7 | | | 106 | Female | 10 | Primary |
| 8 | Yes, cur | 22 | 106 | Female | 23 | Diploma |
| 9 | | | 106 | Female | 0 | |
+-----+----------+-----+-----+--------+-----+----------+
所以在这个例子中,我想要一个新列,上面写着丈夫的教育,并在第 8 行中,将文凭作为新列中的值(因为女方的丈夫是 22 岁,男方是 22 岁家庭有文凭)。
相同样本,没有值标签:
+-----+-----+-----+-----+-----+-----+------+
| HL1 | MA1 | MA2 | hid | HL4 | HL6 | ED4A |
+-----+-----+-----+-----+-----+-----+------+
| 1 | | | 106 | 1 | 57 | 4 |
| 2 | | | 106 | 2 | 53 | 2 |
| 3 | | | 106 | 1 | 30 | 6 |
| 4 | 3 | | 106 | 2 | 24 | 5 |
| 5 | | | 106 | 1 | 22 | 4 |
| 6 | | | 106 | 1 | 17 | 3 |
| 7 | | | 106 | 2 | 10 | 1 |
| 8 | 1 | 22 | 106 | 2 | 23 | 4 |
| 9 | | | 106 | 2 | 0 | |
+-----+-----+-----+-----+-----+-----+------+
一个特别大的家庭:
input
HL1 MA1 MA2 hid HL4 HL6 ED4A
1 . . 365809 1 33 1
2 1 33 365809 2 26 1
1 . . 365810 1 58 1
2 . . 365810 2 54 .
3 . . 365810 1 23 3
4 . . 365810 1 23 2
5 . . 365810 1 18 3
6 . . 365810 1 15 2
7 . . 365810 2 12 2
8 . . 365810 1 33 3
9 1 dk 365810 2 31 1
10 . . 365810 2 13 2
11 . . 365810 2 11 1
12 . . 365810 1 9 1
13 . . 365810 1 6 1
14 . . 365810 2 3 .
15 . . 365810 1 2 .
16 . . 365810 1 33 3
17 1 33 365810 2 30 1
18 . . 365810 1 8 1
19 . . 365810 2 6 1
20 . . 365810 2 5 .
21 . . 365810 1 1 .
22 . . 365810 1 32 4
23 1 32 365810 2 30 1
24 . . 365810 1 5 .
25 . . 365810 2 3 .
26 . . 365810 1 2 .
27 . . 365810 1 30 4
28 1 30 365810 2 28 1
29 . . 365810 2 2 .
30 . . 365810 1 0 .
31 . . 365810 1 27 2
32 1 27 365810 2 27 1
33 . . 365810 2 2 .
34 . . 365810 2 0 .
end
由于您已经概述了执行所需操作的必要步骤,为此编写一个简单的脚本应该没有问题。
根据我的经验,如果您 write/do 分别执行每个步骤(并查看每个步骤之后发生的情况,是否引入任何错误等),则更容易学习语法。掌握了它之后,您可以将代码精简到一行。这样的事情应该可行(尝试按照您问题中的步骤操作):
*look at wife currently married
*not necessary, as only married women have MA2, but next step takes only married women into account
* generate husbands age variable and spread to whole household (new var to keep original MA2 untouched)
gen husband_age=MA2 if MA1==married & HL4==woman
bys hid: egen husband_age_hid=max(husband_age)
*mark which individual is the husband (assumed this is what was meant by pairing age of husband with age of male in household)
gen husband=0
bys hid: replace husband = 1 if husband_age_hid == HL6
*copy husbands education information to the whole household
gen husband_ED4 = ED4 if husband==1
bys hid: egen husb_educ=max(husband_ED4)
*data cleaning, if necessary
drop husband*
在第一步中使用 tempvars 而不是生成新变量可能会更好,但认为这些变量稍后可能会有用。
这是一个开始。该代码确实会遍历每个家庭中的不同已婚妇女,但如果有两个或更多男人与丈夫的年龄相匹配,它就不会做任何事情。
input HL1 MA1 MA2 hid HL4 HL6 ED4A
1 . . 106 1 57 4
2 . . 106 2 53 2
3 . . 106 1 30 6
4 3 . 106 2 24 5
5 . . 106 1 22 4
6 . . 106 1 17 3
7 . . 106 2 10 1
8 1 22 106 2 23 4
9 . . 106 2 0 .
end
bysort hid (MA1) : gen wid = _n if MA1 == 1
su wid, meanonly
local max = r(max)
gen heducation = .
quietly forval i = 1/`max' {
bysort hid : egen hage = min(cond(wid == `i', MA2, .))
by hid : egen nmatches = total(HL4 == 1 & HL6 == hage)
by hid : egen work = min(cond(nmatches == 1 & HL6 == hage, ED4, .))
replace heducation = work if wid == `i'
drop hage nmatches work
}
sort hid HL1
list
+-----------------------------------------------------------+
| HL1 MA1 MA2 hid HL4 HL6 ED4A wid heduca~n |
|-----------------------------------------------------------|
1. | 1 . . 106 1 57 4 . . |
2. | 2 . . 106 2 53 2 . . |
3. | 3 . . 106 1 30 6 . . |
4. | 4 3 . 106 2 24 5 . . |
5. | 5 . . 106 1 22 4 . . |
|-----------------------------------------------------------|
6. | 6 . . 106 1 17 3 . . |
7. | 7 . . 106 2 10 1 . . |
8. | 8 1 22 106 2 23 4 1 4 |
9. | 9 . . 106 2 0 . . . |
+-----------------------------------------------------------+
(更新)
扩展示例发现一个错误:一次计算不够严格,没有排除同龄女性。 (顺便说一句,请注意新数据是针对两个家庭,而不是一个。)
bysort hid (MA1) : gen wid = _n if MA1 == 1
su wid, meanonly
local max = r(max)
gen heducation = .
quietly forval i = 1/`max' {
bysort hid : egen hage`i' = min(cond(wid == `i', MA2, .))
by hid : egen nmatches`i' = total(HL4 == 1 & HL6 == hage`i')
by hid : egen work`i' = min(cond(nmatches`i' == 1 & HL6 == hage`i' & HL4 == 1, ED4, .))
replace heducation = work`i' if wid == `i'
}
sort hid wid HL1
list hid wid MA2 HL6 ED4 heducation HL4 if inlist(HL6, 27, 30, 32, 33) | MA2 < ., sepby(hid)
+--------------------------------------------------+
| hid wid MA2 HL6 ED4A heduca~n HL4 |
|--------------------------------------------------|
1. | 365809 1 33 26 1 1 2 |
2. | 365809 . . 33 1 . 1 |
|--------------------------------------------------|
3. | 365810 1 27 27 1 2 2 |
4. | 365810 2 33 30 1 . 2 |
5. | 365810 3 32 30 1 4 2 |
6. | 365810 4 30 28 1 4 2 |
14. | 365810 . . 33 3 . 1 |
21. | 365810 . . 33 3 . 1 |
26. | 365810 . . 32 4 . 1 |
30. | 365810 . . 30 4 . 1 |
33. | 365810 . . 27 2 . 1 |
+--------------------------------------------------+
有关更一般性的讨论,请参阅
我想在我的横截面调查数据集中创建一个新列,其中包括女性丈夫的教育程度。我有家庭(hid)和个人(HL1)的ID,以及以下信息:
- MA1==女性是否已婚(仅女性可观察数据)
- MA2==丈夫的年龄(数据仅对已婚女性可见)
- HL4== 性别(所有个体均可观察到的数据)
- HL6== 年龄(所有个体的可观察数据)
- ED4A== 最高教育水平(所有人都可观察到的数据)
本质上,我想创建代码来执行以下操作:
- 先看老婆是否已婚(MA1)
- 如果是,再看老公年龄(MA2)
- 然后将丈夫的年龄 (MA2) 与家庭中男性的年龄 (HL6) 配对
- 然后查看男性的教育程度 (ED4A) 并将该教育程度放在新的列中,但与女性的行号在同一行。
我试过了,但没用:
bysort hid (HL6) : gen husb_educ = ED4A[MA2]
下面是来自数据集的示例:
+-----+----------+-----+-----+--------+-----+----------+
| HL1 | MA1 | MA2 | hid | HL4 | HL6 | ED4A |
+-----+----------+-----+-----+--------+-----+----------+
| 1 | | | 106 | Male | 57 | Diploma |
| 2 | | | 106 | Female | 53 | Intermed |
| 3 | | | 106 | Male | 30 | Higher S |
| 4 | No, not | | 106 | Female | 24 | Bachelor |
| 5 | | | 106 | Male | 22 | Diploma |
| 6 | | | 106 | Male | 17 | Secondar |
| 7 | | | 106 | Female | 10 | Primary |
| 8 | Yes, cur | 22 | 106 | Female | 23 | Diploma |
| 9 | | | 106 | Female | 0 | |
+-----+----------+-----+-----+--------+-----+----------+
所以在这个例子中,我想要一个新列,上面写着丈夫的教育,并在第 8 行中,将文凭作为新列中的值(因为女方的丈夫是 22 岁,男方是 22 岁家庭有文凭)。
相同样本,没有值标签:
+-----+-----+-----+-----+-----+-----+------+
| HL1 | MA1 | MA2 | hid | HL4 | HL6 | ED4A |
+-----+-----+-----+-----+-----+-----+------+
| 1 | | | 106 | 1 | 57 | 4 |
| 2 | | | 106 | 2 | 53 | 2 |
| 3 | | | 106 | 1 | 30 | 6 |
| 4 | 3 | | 106 | 2 | 24 | 5 |
| 5 | | | 106 | 1 | 22 | 4 |
| 6 | | | 106 | 1 | 17 | 3 |
| 7 | | | 106 | 2 | 10 | 1 |
| 8 | 1 | 22 | 106 | 2 | 23 | 4 |
| 9 | | | 106 | 2 | 0 | |
+-----+-----+-----+-----+-----+-----+------+
一个特别大的家庭:
input
HL1 MA1 MA2 hid HL4 HL6 ED4A
1 . . 365809 1 33 1
2 1 33 365809 2 26 1
1 . . 365810 1 58 1
2 . . 365810 2 54 .
3 . . 365810 1 23 3
4 . . 365810 1 23 2
5 . . 365810 1 18 3
6 . . 365810 1 15 2
7 . . 365810 2 12 2
8 . . 365810 1 33 3
9 1 dk 365810 2 31 1
10 . . 365810 2 13 2
11 . . 365810 2 11 1
12 . . 365810 1 9 1
13 . . 365810 1 6 1
14 . . 365810 2 3 .
15 . . 365810 1 2 .
16 . . 365810 1 33 3
17 1 33 365810 2 30 1
18 . . 365810 1 8 1
19 . . 365810 2 6 1
20 . . 365810 2 5 .
21 . . 365810 1 1 .
22 . . 365810 1 32 4
23 1 32 365810 2 30 1
24 . . 365810 1 5 .
25 . . 365810 2 3 .
26 . . 365810 1 2 .
27 . . 365810 1 30 4
28 1 30 365810 2 28 1
29 . . 365810 2 2 .
30 . . 365810 1 0 .
31 . . 365810 1 27 2
32 1 27 365810 2 27 1
33 . . 365810 2 2 .
34 . . 365810 2 0 .
end
由于您已经概述了执行所需操作的必要步骤,为此编写一个简单的脚本应该没有问题。 根据我的经验,如果您 write/do 分别执行每个步骤(并查看每个步骤之后发生的情况,是否引入任何错误等),则更容易学习语法。掌握了它之后,您可以将代码精简到一行。这样的事情应该可行(尝试按照您问题中的步骤操作):
*look at wife currently married
*not necessary, as only married women have MA2, but next step takes only married women into account
* generate husbands age variable and spread to whole household (new var to keep original MA2 untouched)
gen husband_age=MA2 if MA1==married & HL4==woman
bys hid: egen husband_age_hid=max(husband_age)
*mark which individual is the husband (assumed this is what was meant by pairing age of husband with age of male in household)
gen husband=0
bys hid: replace husband = 1 if husband_age_hid == HL6
*copy husbands education information to the whole household
gen husband_ED4 = ED4 if husband==1
bys hid: egen husb_educ=max(husband_ED4)
*data cleaning, if necessary
drop husband*
在第一步中使用 tempvars 而不是生成新变量可能会更好,但认为这些变量稍后可能会有用。
这是一个开始。该代码确实会遍历每个家庭中的不同已婚妇女,但如果有两个或更多男人与丈夫的年龄相匹配,它就不会做任何事情。
input HL1 MA1 MA2 hid HL4 HL6 ED4A
1 . . 106 1 57 4
2 . . 106 2 53 2
3 . . 106 1 30 6
4 3 . 106 2 24 5
5 . . 106 1 22 4
6 . . 106 1 17 3
7 . . 106 2 10 1
8 1 22 106 2 23 4
9 . . 106 2 0 .
end
bysort hid (MA1) : gen wid = _n if MA1 == 1
su wid, meanonly
local max = r(max)
gen heducation = .
quietly forval i = 1/`max' {
bysort hid : egen hage = min(cond(wid == `i', MA2, .))
by hid : egen nmatches = total(HL4 == 1 & HL6 == hage)
by hid : egen work = min(cond(nmatches == 1 & HL6 == hage, ED4, .))
replace heducation = work if wid == `i'
drop hage nmatches work
}
sort hid HL1
list
+-----------------------------------------------------------+
| HL1 MA1 MA2 hid HL4 HL6 ED4A wid heduca~n |
|-----------------------------------------------------------|
1. | 1 . . 106 1 57 4 . . |
2. | 2 . . 106 2 53 2 . . |
3. | 3 . . 106 1 30 6 . . |
4. | 4 3 . 106 2 24 5 . . |
5. | 5 . . 106 1 22 4 . . |
|-----------------------------------------------------------|
6. | 6 . . 106 1 17 3 . . |
7. | 7 . . 106 2 10 1 . . |
8. | 8 1 22 106 2 23 4 1 4 |
9. | 9 . . 106 2 0 . . . |
+-----------------------------------------------------------+
(更新)
扩展示例发现一个错误:一次计算不够严格,没有排除同龄女性。 (顺便说一句,请注意新数据是针对两个家庭,而不是一个。)
bysort hid (MA1) : gen wid = _n if MA1 == 1
su wid, meanonly
local max = r(max)
gen heducation = .
quietly forval i = 1/`max' {
bysort hid : egen hage`i' = min(cond(wid == `i', MA2, .))
by hid : egen nmatches`i' = total(HL4 == 1 & HL6 == hage`i')
by hid : egen work`i' = min(cond(nmatches`i' == 1 & HL6 == hage`i' & HL4 == 1, ED4, .))
replace heducation = work`i' if wid == `i'
}
sort hid wid HL1
list hid wid MA2 HL6 ED4 heducation HL4 if inlist(HL6, 27, 30, 32, 33) | MA2 < ., sepby(hid)
+--------------------------------------------------+
| hid wid MA2 HL6 ED4A heduca~n HL4 |
|--------------------------------------------------|
1. | 365809 1 33 26 1 1 2 |
2. | 365809 . . 33 1 . 1 |
|--------------------------------------------------|
3. | 365810 1 27 27 1 2 2 |
4. | 365810 2 33 30 1 . 2 |
5. | 365810 3 32 30 1 4 2 |
6. | 365810 4 30 28 1 4 2 |
14. | 365810 . . 33 3 . 1 |
21. | 365810 . . 33 3 . 1 |
26. | 365810 . . 32 4 . 1 |
30. | 365810 . . 30 4 . 1 |
33. | 365810 . . 27 2 . 1 |
+--------------------------------------------------+
有关更一般性的讨论,请参阅