总结一年内的公司数量

Question

假设我有一个数据框如下：

dt=structure(list(id = c(1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L, 5L, 5L, 6L, 6L), year = c(2001L, 2002L, 2003L, 2004L, 
2002L, 2002L, 2003L, 2004L, 2002L, 2003L, 2004L, 2005L, 2001L, 
2002L, 2001L, 2002L)), .Names = c("firm", "year"), row.names = c(NA, 
-16L), class = "data.frame")

dt
 firm year
1   1 2001
2   1 2002
3   1 2003
4   1 2004
5   2 2002
6   3 2002
7   3 2003
8   3 2004
9   4 2002
10  4 2003
11  4 2004
12  4 2005
13  5 2001
14  5 2002
15  6 2001
16  6 2002

现在，我希望总结一下一年内退出市场的公司数量。例如，我想要这样的 table：

 resulttable
     All 2001 2002 2003 2004 2005
2001   3    0    2    0    1    0
2002   3    0    1    0    1    1

结果第一行table表示2001年有3家进入市场，2003年有2家退出，2004年有1家退出。谢谢！

Answer 1

您可以将 "enter" 年和 "exit" 年制表为 table:

res <- table(
    dt$year[!duplicated(dt$firm)],
    factor(dt$year[!duplicated(dt$firm, fromLast = TRUE)], levels = unique(dt$year))
)
res <- as.data.frame.matrix(res)
res$All <- rowSums(res)

# > res
#      2001 2002 2003 2004 2005 All
# 2001    0    2    0    1    0   3
# 2002    0    1    0    1    1   3

我假设 dt 已按照提供的方式排序。如果不是，则必须先按年份排序。

这里是评论里的latemail推荐的方式，结果是这样的：

addmargins(table(
    dt$year[!duplicated(dt$firm)],
    factor(dt$year[!duplicated(dt$firm, fromLast = TRUE)], levels = unique(dt$year))
), 2)

#      2001 2002 2003 2004 2005 Sum
# 2001    0    2    0    1    0   3
# 2002    0    1    0    1    1   3

Answer 2

这不是一个完整的解决方案，因为结果不包括缺失的 'exited' 年。包括它们是可能的，但需要很多额外的步骤。使用两个库，dplyr 和 tidyr 我们可以完成整个过程。

library(dplyr)
library(tidyr)
dt %>% 
  group_by(firm) %>% 
  summarise(entered=min(year),exited=max(year),count=1) %>% 
  group_by(entered,exited) %>% 
  summarise(count=sum(count)) %>%
  mutate(All = sum(count)) %>% 
  ungroup() %>% 
  spread(exited,count,fill=0)

> # A tibble: 2 x 5
>   entered   All `2002` `2004` `2005`
> *   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
> 1    2001     3      2      1      0
> 2    2002     3      1      1      1

group_by表示我们要在firm
summarise 将每组计算一次值，这里我们得到 entered、exited 并产生一个计数变量 count
现在我们按 entered 和 exited 分组（顺序很重要），所以我们在两个年份的交叉点上分组
我们现在通过年份的组合对我们的计数变量求和。 summarise 这里去掉最右边的分组
mutate 创建一个新变量，All 在这种情况下看起来就像 summarise 但它不是折叠行，而是在我们的组内计算它，复制重复的行.
ungroup 删除残差分组
spread 为我们键中的每个值创建一个列，用指定的值列填充它，用 0

Answer 3

这是一个使用 data.table 中的 dcast 的选项。将'data.frame'转换为'data.table'（setDT(dt)），按'firm'分组得到'year'的range分成两列，dcast使用 drop = FALSE 进入 'wide' 以避免删除未使用的级别，然后将行中的值与 Reduce

相加

library(data.table)
dcast(setDT(dt)[, as.list(range(year)), firm], V1 ~ factor(V2, levels = unique(dt$year)), 
          drop =FALSE)[, All := Reduce(`+` , .SD), .SDcols = -1][]
#      V1 2001 2002 2003 2004 2005 All
#1: 2001    0    2    0    1    0   3
#2: 2002    0    1    0    1    1   3

Answer 4

这里是一种使用 data.table 的略有不同的方法，它在之前从长格式重塑为宽格式：

library(data.table)
setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][, All := .N, by = entry][
  , dcast(.SD, entry + All ~ exit, length, value.var = "firm")]

   entry All 2002 2004 2005
1:  2001   3    2    1    0
2:  2002   3    1    1    1

这已经传达了 OP 在问题中口头描述的所有基本结果。

然而，OP 的预期结果包括年份 2001 和 2003 的列，尽管它们只包含 0。如果需要显示没有进入或退出的年份，这可以通过完成在计算总数 All 和重塑之前缺失的年份：

setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][
  CJ(entry = dt$year, exit = dt$year, unique = TRUE), on = .(entry, exit)][
    , All := sum(!is.na(firm)), by = entry][][
      , dcast(.SD, entry + All ~ exit, function(x) (sum(!is.na(x))), value.var = "firm")]

   entry All 2001 2002 2003 2004 2005
1:  2001   3    0    2    0    1    0
2:  2002   3    0    1    0    1    1
3:  2003   0    0    0    0    0    0
4:  2004   0    0    0    0    0    0
5:  2005   0    0    0    0    0    0

通过 year 的所有可用组合的 table 连接完成缺失的年份，该组合由 cross join CJ().补全在firm中引入了很多NA的值，因此length(firm)不得不换成sum(!is.na(firm))作为聚合函数。

生成的宽格式的扩展可以通过 CJ() 中给出的年份范围来控制。例如，空条目年份 2003 到 2005 可以删除

setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][
  CJ(entry = min(entry):max(entry), exit = dt$year, unique = TRUE), on = .(entry, exit)][
    , All := sum(!is.na(firm)), by = entry][][
      , dcast(.SD, entry + All ~ exit, function(x) (sum(!is.na(x))), value.var = "firm")]

   entry All 2001 2002 2003 2004 2005
1:  2001   3    0    2    0    1    0
2:  2002   3    0    1    0    1    1

完全重现了 OP 的预期 resulttable。

或者，也将 2001 年和 2005 年的空退出年份减去

setDT(dt)[, .(entry = min(year), exit = max(year)), by = firm][
  CJ(entry = min(entry):max(entry), exit = min(exit):max(exit)), on = .(entry, exit)][
    , All := sum(!is.na(firm)), by = entry][][
      , dcast(.SD, entry + All ~ exit, function(x) (sum(!is.na(x))), value.var = "firm")]

   entry All 2002 2003 2004 2005
1:  2001   3    2    0    1    0
2:  2002   3    1    0    1    1

总结一年内的公司数量

Sum the number of firms in one year

r

dplyr

data.table

tidyr

这里是评论里的latemail推荐的方式，结果是这样的：