随时间计算因子水平
Count factor levels over time
我有以下 data.frame 看起来像这样:
head(entries,10)
Provider.Region year.start month.start day.start Provider.Status
23511 North West 0010 05 17 Deregistered (V)
23512 North West 0010 05 17 Deregistered (V)
23709 West Midlands 0010 06 01 Registered
23562 London 0010 06 10 Registered
23563 London 0010 06 10 Registered
23566 London 0010 06 10 Registered
23764 West Midlands 0010 06 10 Deregistered (V)
23508 London 0010 06 11 Deregistered (V)
23555 West Midlands 0010 06 11 Registered
23497 South East 0010 06 14 Deregistered (V)
我想按月统计Provider.Status
对应的因子水平。我想要的输出应该是这样的:
head(entries.1, 3)
time region Deregistered (V) Registered
5-0010 North West 2 0
6-0010 West Midlands 2 1
6-0010 London 1 3
目前我一直在使用dplyr
如下
library(dplyr)
entries %>%
group_by(Provider.Region, year.start, month.start) %>%
mutate(counts_status = n())
但仍然没有产生我预期的输出,因为它给出了如下内容:
Source: local data frame [23,775 x 6]
Groups: Provider.Region, year.start, month.start [606]
Provider.Region year.start month.start Provider.Status counts_status
(fctr) (fctr) (fctr) (fctr) (int)
1 North West 0010 05 Deregistered (V) 2
2 North West 0010 05 Deregistered (V) 2
3 West Midlands 0010 06 Registered 4
4 London 0010 06 Registered 7
5 London 0010 06 Registered 7
6 London 0010 06 Registered 7
7 West Midlands 0010 06 Deregistered (V) 4
8 London 0010 06 Deregistered (V) 7
9 West Midlands 0010 06 Registered 4
10 South East 0010 06 Deregistered (V) 10
.. ... ... ... ... ...
有什么紧凑的方法可以根据计数创建变量吗?非常感谢
您可以使用 reshape2 包来生成这样的 table:
library(reshape2)
d <- data.frame(region=rep(c("A", "B", "C"), each=2), timepoint = c(1, 1, 1, 1, 2, 2), provider=rep(c("D", "R"), 3), count_status = 1:6)
dcast(d, region + timepoint ~ provider, value.var = "count_status")
得到这个输出:
region timepoint D R
1 A 1 1 2
2 B 1 3 4
3 C 2 5 6
这可以使用 reshape2 或 data.table 包中的 dcast
函数来实现:
library(reshape2)
dcast(mydf, paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)
library(data.table)
dcast(setDT(mydf), paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)
最后一个输出:
year.start Provider.Region Deregistered(V) Registered
1: 0010-05 NorthWest 2 0
2: 0010-06 London 1 3
3: 0010-06 SouthEast 1 0
4: 0010-06 WestMidlands 1 2
使用以上代码时,会收到警告信息:
Using 'Provider.Status' as value column. Use 'value.var' to override
Aggregate function missing, defaulting to 'length'
这没有任何影响,但要防止您可以指定 value.var
和聚合函数:
dcast(setDT(mydf),
paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status,
value.var = "Provider.Status", fun.aggregate = length)
我有以下 data.frame 看起来像这样:
head(entries,10)
Provider.Region year.start month.start day.start Provider.Status
23511 North West 0010 05 17 Deregistered (V)
23512 North West 0010 05 17 Deregistered (V)
23709 West Midlands 0010 06 01 Registered
23562 London 0010 06 10 Registered
23563 London 0010 06 10 Registered
23566 London 0010 06 10 Registered
23764 West Midlands 0010 06 10 Deregistered (V)
23508 London 0010 06 11 Deregistered (V)
23555 West Midlands 0010 06 11 Registered
23497 South East 0010 06 14 Deregistered (V)
我想按月统计Provider.Status
对应的因子水平。我想要的输出应该是这样的:
head(entries.1, 3)
time region Deregistered (V) Registered
5-0010 North West 2 0
6-0010 West Midlands 2 1
6-0010 London 1 3
目前我一直在使用dplyr
如下
library(dplyr)
entries %>%
group_by(Provider.Region, year.start, month.start) %>%
mutate(counts_status = n())
但仍然没有产生我预期的输出,因为它给出了如下内容:
Source: local data frame [23,775 x 6]
Groups: Provider.Region, year.start, month.start [606]
Provider.Region year.start month.start Provider.Status counts_status
(fctr) (fctr) (fctr) (fctr) (int)
1 North West 0010 05 Deregistered (V) 2
2 North West 0010 05 Deregistered (V) 2
3 West Midlands 0010 06 Registered 4
4 London 0010 06 Registered 7
5 London 0010 06 Registered 7
6 London 0010 06 Registered 7
7 West Midlands 0010 06 Deregistered (V) 4
8 London 0010 06 Deregistered (V) 7
9 West Midlands 0010 06 Registered 4
10 South East 0010 06 Deregistered (V) 10
.. ... ... ... ... ...
有什么紧凑的方法可以根据计数创建变量吗?非常感谢
您可以使用 reshape2 包来生成这样的 table:
library(reshape2)
d <- data.frame(region=rep(c("A", "B", "C"), each=2), timepoint = c(1, 1, 1, 1, 2, 2), provider=rep(c("D", "R"), 3), count_status = 1:6)
dcast(d, region + timepoint ~ provider, value.var = "count_status")
得到这个输出:
region timepoint D R
1 A 1 1 2
2 B 1 3 4
3 C 2 5 6
这可以使用 reshape2 或 data.table 包中的 dcast
函数来实现:
library(reshape2)
dcast(mydf, paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)
library(data.table)
dcast(setDT(mydf), paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)
最后一个输出:
year.start Provider.Region Deregistered(V) Registered
1: 0010-05 NorthWest 2 0
2: 0010-06 London 1 3
3: 0010-06 SouthEast 1 0
4: 0010-06 WestMidlands 1 2
使用以上代码时,会收到警告信息:
Using 'Provider.Status' as value column. Use 'value.var' to override
Aggregate function missing, defaulting to 'length'
这没有任何影响,但要防止您可以指定 value.var
和聚合函数:
dcast(setDT(mydf),
paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status,
value.var = "Provider.Status", fun.aggregate = length)