如何按组计算不同的字符串大小写并在 Stata 中包含零?

How to count distinct string cases by group and include zero in Stata?

我有一个具有以下结构的数据集:

clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end

我想计算每组年份状态中有多少个不同的文档,包括零。换句话说,我希望我的输出是这样的:

  +----------------------------+
  | year   state   n_documents |
  |----------------------------|
  | 2006      AK             2 |
  | 2007      AK             2 |
  | 2008      AK             1 |
  | 2009      AK             0 |
  | 2006      AL             1 |
  | 2007      AL             0 |
  | 2008      AL             3 |
  | 2009      AL             2 |
  | 2006      AS             3 |
  | 2007      AS             1 |
  | 2008      AS             1 |
  | 2009      AS             3 |
  +----------------------------+

我尝试使用 egen 命令中的标记函数解决这个问题:

egen tag = tag(year state document)
egen n_documents = total(tag), by(year state)

collapse (first) n_documents, by(year state)

sort state year
list, sep(0) abb(20)

但是,我最终得到以下数据集(没有零):


  +----------------------------+
  | year   state   n_documents |
  |----------------------------|
  | 2006      AK             2 |
  | 2007      AK             2 |
  | 2008      AK             1 |
  | 2006      AL             1 |
  | 2008      AL             3 |
  | 2009      AL             2 |
  | 2006      AS             3 |
  | 2007      AS             1 |
  | 2008      AS             1 |
  | 2009      AS             3 |
  +----------------------------+

当然,我可以在没有文档的情况下手动包含年-州的其余组合,但在现实生活中,我的数据集有将近一百万个观测值,因此手动解决方案在这里不切实际。有没有办法在 stata 中做到这一点?

感谢您提供数据示例和清晰的描述。一个技巧是 reshape wide 并返回 long,然后 replace 缺失 0.

clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end

egen tag = tag(year state document)
collapse (sum) n_documents=tag, by(state year)

reshape wide n_documents, i(state) j(year)
reshape long
mvencode n_documents, mv(0)

这是另一种方法。

clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end

contract year state document, zero freq(distinct)  
replace distinct = distinct > 0 
collapse (sum) distinct, by(state year)

list , sepby(state)

     +-------------------------+
     | year   state   distinct |
     |-------------------------|
  1. | 2006      AK          2 |
  2. | 2007      AK          2 |
  3. | 2008      AK          1 |
  4. | 2009      AK          0 |
     |-------------------------|
  5. | 2006      AL          1 |
  6. | 2007      AL          0 |
  7. | 2008      AL          3 |
  8. | 2009      AL          2 |
     |-------------------------|
  9. | 2006      AS          3 |
 10. | 2007      AS          1 |
 11. | 2008      AS          1 |
 12. | 2009      AS          3 |
     +-------------------------+

编辑 @Romalpa Akzo 指出了这个更直接的解决方案

 contract state year document, nomiss
 contract state year, freq(n_document) zero