如何按组计算不同的字符串大小写并在 Stata 中包含零?
How to count distinct string cases by group and include zero in Stata?
我有一个具有以下结构的数据集:
clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end
我想计算每组年份状态中有多少个不同的文档,包括零。换句话说,我希望我的输出是这样的:
+----------------------------+
| year state n_documents |
|----------------------------|
| 2006 AK 2 |
| 2007 AK 2 |
| 2008 AK 1 |
| 2009 AK 0 |
| 2006 AL 1 |
| 2007 AL 0 |
| 2008 AL 3 |
| 2009 AL 2 |
| 2006 AS 3 |
| 2007 AS 1 |
| 2008 AS 1 |
| 2009 AS 3 |
+----------------------------+
我尝试使用 egen
命令中的标记函数解决这个问题:
egen tag = tag(year state document)
egen n_documents = total(tag), by(year state)
collapse (first) n_documents, by(year state)
sort state year
list, sep(0) abb(20)
但是,我最终得到以下数据集(没有零):
+----------------------------+
| year state n_documents |
|----------------------------|
| 2006 AK 2 |
| 2007 AK 2 |
| 2008 AK 1 |
| 2006 AL 1 |
| 2008 AL 3 |
| 2009 AL 2 |
| 2006 AS 3 |
| 2007 AS 1 |
| 2008 AS 1 |
| 2009 AS 3 |
+----------------------------+
当然,我可以在没有文档的情况下手动包含年-州的其余组合,但在现实生活中,我的数据集有将近一百万个观测值,因此手动解决方案在这里不切实际。有没有办法在 stata 中做到这一点?
感谢您提供数据示例和清晰的描述。一个技巧是 reshape wide
并返回 long
,然后 replace
缺失 0.
clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end
egen tag = tag(year state document)
collapse (sum) n_documents=tag, by(state year)
reshape wide n_documents, i(state) j(year)
reshape long
mvencode n_documents, mv(0)
这是另一种方法。
clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end
contract year state document, zero freq(distinct)
replace distinct = distinct > 0
collapse (sum) distinct, by(state year)
list , sepby(state)
+-------------------------+
| year state distinct |
|-------------------------|
1. | 2006 AK 2 |
2. | 2007 AK 2 |
3. | 2008 AK 1 |
4. | 2009 AK 0 |
|-------------------------|
5. | 2006 AL 1 |
6. | 2007 AL 0 |
7. | 2008 AL 3 |
8. | 2009 AL 2 |
|-------------------------|
9. | 2006 AS 3 |
10. | 2007 AS 1 |
11. | 2008 AS 1 |
12. | 2009 AS 3 |
+-------------------------+
编辑 @Romalpa Akzo 指出了这个更直接的解决方案
contract state year document, nomiss
contract state year, freq(n_document) zero
我有一个具有以下结构的数据集:
clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end
我想计算每组年份状态中有多少个不同的文档,包括零。换句话说,我希望我的输出是这样的:
+----------------------------+
| year state n_documents |
|----------------------------|
| 2006 AK 2 |
| 2007 AK 2 |
| 2008 AK 1 |
| 2009 AK 0 |
| 2006 AL 1 |
| 2007 AL 0 |
| 2008 AL 3 |
| 2009 AL 2 |
| 2006 AS 3 |
| 2007 AS 1 |
| 2008 AS 1 |
| 2009 AS 3 |
+----------------------------+
我尝试使用 egen
命令中的标记函数解决这个问题:
egen tag = tag(year state document)
egen n_documents = total(tag), by(year state)
collapse (first) n_documents, by(year state)
sort state year
list, sep(0) abb(20)
但是,我最终得到以下数据集(没有零):
+----------------------------+
| year state n_documents |
|----------------------------|
| 2006 AK 2 |
| 2007 AK 2 |
| 2008 AK 1 |
| 2006 AL 1 |
| 2008 AL 3 |
| 2009 AL 2 |
| 2006 AS 3 |
| 2007 AS 1 |
| 2008 AS 1 |
| 2009 AS 3 |
+----------------------------+
当然,我可以在没有文档的情况下手动包含年-州的其余组合,但在现实生活中,我的数据集有将近一百万个观测值,因此手动解决方案在这里不切实际。有没有办法在 stata 中做到这一点?
感谢您提供数据示例和清晰的描述。一个技巧是 reshape wide
并返回 long
,然后 replace
缺失 0.
clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end
egen tag = tag(year state document)
collapse (sum) n_documents=tag, by(state year)
reshape wide n_documents, i(state) j(year)
reshape long
mvencode n_documents, mv(0)
这是另一种方法。
clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end
contract year state document, zero freq(distinct)
replace distinct = distinct > 0
collapse (sum) distinct, by(state year)
list , sepby(state)
+-------------------------+
| year state distinct |
|-------------------------|
1. | 2006 AK 2 |
2. | 2007 AK 2 |
3. | 2008 AK 1 |
4. | 2009 AK 0 |
|-------------------------|
5. | 2006 AL 1 |
6. | 2007 AL 0 |
7. | 2008 AL 3 |
8. | 2009 AL 2 |
|-------------------------|
9. | 2006 AS 3 |
10. | 2007 AS 1 |
11. | 2008 AS 1 |
12. | 2009 AS 3 |
+-------------------------+
编辑 @Romalpa Akzo 指出了这个更直接的解决方案
contract state year document, nomiss
contract state year, freq(n_document) zero