有效的分类方法
Efficient Way to Categorize
我有一个 sting 变量,称之为 desc
,它具有许多不同的值,比如 300。我想创建两个新变量,desc_a
和 desc_b
。 desc
包含两个 class 值;我想把属于第一个 class 的那些放在 desc_a
中,其余的放在 desc_b
中。我将描述我想出的一种方法。但是,这种方法非常慢。我想知道是否有更好的方法来做到这一点。
gen desc_a = ""
gen desc_b = ""
tab desc
生成的选项卡输出可能如下所示(省略不相关的信息):
DESC | Freq. Perc. Cum.
___________________________________________
First Element of a 53
Second Element of a 22
First Element of b 78
Third Element of a 232
Second Element of b 33
*手动检查并将选项卡的每个字符串复制并粘贴到语句中,例如:
replace desc_a = "First Element of a" if desc=="First Element of a"
replace desc_a = "Second Element of a" if desc=="Second Element of a"
replace desc_a = "Third Element of a" if desc=="Third Element of a"
...
replace desc_b = "First Element of b" if desc=="First Element of b"
replace desc_b = "Second Element of b" if desc=="Second Element of b"
请注意,实际数据实际上并没有遵循这样一个很好的模式,所以我不能通过正则表达式或类似的东西来自动化它。我确实需要手动检查每一个并决定它应该归入哪个类别。但是,我确实认为我所描述的涉及大量复制和粘贴的方法并不是最好的方法。
这不是最好的,但它比我上面的解决方案有所改进:
gen desc_a = ""
replace
replace desc_a = desc if desc=="First Element of a"
replace desc_a = desc if desc=="Second Element of a"
replace desc_a = desc if desc=="Third Element of a"
...
replace desc_b = desc if desc_a==""
Stata 数据编辑器 window 将帮助您减少工作量。
创建一个包含两个变量的 Stata 数据集:desc 的 300 个不同值和一个变量,我称之为 ab,初始化为缺失。然后在 Stata 数据编辑器中打开数据集并查看观察结果,用描述是否属于组 a 或 b(比如 1 或 2)的指示符替换(通过在单元格中键入)缺失值。然后保存该数据集并将其与原始数据集合并,并使用合并值 ab 将描述分配给适当的变量。
generate desc_a = desc if ab==1
generate desc_b = desc if ab==2
扩展@William 的解决方案
* recreate your data example
clear
input str19 desc int n
"First Element of a" 53
"Second Element of a" 22
"First Element of b " 78
"Third Element of a" 232
"Second Element of b" 33
end
expand n
set seed 314324
gen somedata = runiform()
sort somedata
tab des
tempfile main
save "`main'"
* reduce to one observation per value of desc
bysort desc: keep if _n == 1
keep desc
* make an effort to identify a or b, note that
* the following fails for one obs
gen ab = regexs(1) if regexm(desc,"(a|b)$")
* save and edit manually
tempfile toedit
save "`toedit'"
* this is simulated editing...
clear
input str19 desc str1 ab
"First Element of a" "a"
"First Element of b " "b"
"Second Element of a" "a"
"Second Element of b" "b"
"Third Element of a" "a"
end
* now combine with the original data
merge 1:m desc using "`main'", assert(match) nogen
我有一个 sting 变量,称之为 desc
,它具有许多不同的值,比如 300。我想创建两个新变量,desc_a
和 desc_b
。 desc
包含两个 class 值;我想把属于第一个 class 的那些放在 desc_a
中,其余的放在 desc_b
中。我将描述我想出的一种方法。但是,这种方法非常慢。我想知道是否有更好的方法来做到这一点。
gen desc_a = ""
gen desc_b = ""
tab desc
生成的选项卡输出可能如下所示(省略不相关的信息):
DESC | Freq. Perc. Cum.
___________________________________________
First Element of a 53
Second Element of a 22
First Element of b 78
Third Element of a 232
Second Element of b 33
*手动检查并将选项卡的每个字符串复制并粘贴到语句中,例如:
replace desc_a = "First Element of a" if desc=="First Element of a"
replace desc_a = "Second Element of a" if desc=="Second Element of a"
replace desc_a = "Third Element of a" if desc=="Third Element of a"
...
replace desc_b = "First Element of b" if desc=="First Element of b"
replace desc_b = "Second Element of b" if desc=="Second Element of b"
请注意,实际数据实际上并没有遵循这样一个很好的模式,所以我不能通过正则表达式或类似的东西来自动化它。我确实需要手动检查每一个并决定它应该归入哪个类别。但是,我确实认为我所描述的涉及大量复制和粘贴的方法并不是最好的方法。
这不是最好的,但它比我上面的解决方案有所改进:
gen desc_a = ""
replace
replace desc_a = desc if desc=="First Element of a"
replace desc_a = desc if desc=="Second Element of a"
replace desc_a = desc if desc=="Third Element of a"
...
replace desc_b = desc if desc_a==""
Stata 数据编辑器 window 将帮助您减少工作量。
创建一个包含两个变量的 Stata 数据集:desc 的 300 个不同值和一个变量,我称之为 ab,初始化为缺失。然后在 Stata 数据编辑器中打开数据集并查看观察结果,用描述是否属于组 a 或 b(比如 1 或 2)的指示符替换(通过在单元格中键入)缺失值。然后保存该数据集并将其与原始数据集合并,并使用合并值 ab 将描述分配给适当的变量。
generate desc_a = desc if ab==1
generate desc_b = desc if ab==2
扩展@William 的解决方案
* recreate your data example
clear
input str19 desc int n
"First Element of a" 53
"Second Element of a" 22
"First Element of b " 78
"Third Element of a" 232
"Second Element of b" 33
end
expand n
set seed 314324
gen somedata = runiform()
sort somedata
tab des
tempfile main
save "`main'"
* reduce to one observation per value of desc
bysort desc: keep if _n == 1
keep desc
* make an effort to identify a or b, note that
* the following fails for one obs
gen ab = regexs(1) if regexm(desc,"(a|b)$")
* save and edit manually
tempfile toedit
save "`toedit'"
* this is simulated editing...
clear
input str19 desc str1 ab
"First Element of a" "a"
"First Element of b " "b"
"Second Element of a" "a"
"Second Element of b" "b"
"Third Element of a" "a"
end
* now combine with the original data
merge 1:m desc using "`main'", assert(match) nogen