随时间计算因子水平

Count factor levels over time

我有以下 data.frame 看起来像这样:

head(entries,10)

     Provider.Region      year.start    month.start day.start  Provider.Status
23511      North West       0010          05        17 Deregistered (V)
23512      North West       0010          05        17 Deregistered (V)
23709   West Midlands       0010          06        01       Registered
23562          London       0010          06        10       Registered
23563          London       0010          06        10       Registered
23566          London       0010          06        10       Registered
23764   West Midlands       0010          06        10 Deregistered (V)
23508          London       0010          06        11 Deregistered (V)
23555   West Midlands       0010          06        11       Registered
23497      South East       0010          06        14 Deregistered (V)

我想按月统计Provider.Status对应的因子水平。我想要的输出应该是这样的:

head(entries.1, 3)

time    region        Deregistered (V) Registered 
5-0010  North West        2              0
6-0010  West Midlands     2              1
6-0010  London            1              3

目前我一直在使用dplyr如下

library(dplyr)
entries %>%
  group_by(Provider.Region, year.start, month.start) %>%
  mutate(counts_status = n())  

但仍然没有产生我预期的输出,因为它给出了如下内容:

Source: local data frame [23,775 x 6]
Groups: Provider.Region, year.start, month.start [606]

Provider.Region year.start month.start  Provider.Status counts_status
(fctr)     (fctr)      (fctr)              (fctr)         (int)
1       North West       0010          05 Deregistered (V)      2
2       North West       0010          05 Deregistered (V)      2
3    West Midlands       0010          06 Registered            4
4           London       0010          06 Registered            7
5           London       0010          06 Registered            7
6           London       0010          06 Registered            7
7    West Midlands       0010          06 Deregistered (V)      4
8           London       0010          06 Deregistered (V)      7
9    West Midlands       0010          06 Registered            4
10      South East       0010          06 Deregistered (V)      10
..             ...        ...         ...       ...              ...

有什么紧凑的方法可以根据计数创建变量吗?非常感谢

您可以使用 reshape2 包来生成这样的 table:

library(reshape2)
d <- data.frame(region=rep(c("A", "B", "C"), each=2), timepoint = c(1, 1, 1, 1, 2, 2), provider=rep(c("D", "R"), 3), count_status = 1:6)
dcast(d, region + timepoint ~ provider, value.var = "count_status")

得到这个输出:

  region timepoint D R
1      A         1 1 2
2      B         1 3 4
3      C         2 5 6

这可以使用 reshape2data.table 包中的 dcast 函数来实现:

library(reshape2)
dcast(mydf, paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)

library(data.table)
dcast(setDT(mydf), paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)

最后一个输出:

   year.start Provider.Region Deregistered(V) Registered
1:    0010-05       NorthWest               2          0
2:    0010-06          London               1          3
3:    0010-06       SouthEast               1          0
4:    0010-06    WestMidlands               1          2

使用以上代码时,会收到警告信息:

Using 'Provider.Status' as value column. Use 'value.var' to override
Aggregate function missing, defaulting to 'length'

这没有任何影响,但要防止您可以指定 value.var 和聚合函数:

dcast(setDT(mydf), 
      paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status,
      value.var = "Provider.Status", fun.aggregate = length)