分组和计数以获得更接近的
Grouping and counting to get a closerate
我想计算每个 country
status
的次数 open
和 status
的次数 closed
。然后计算每个 country
的 closerate
。
数据:
customer <- c(1,2,3,4,5,6,7,8,9)
country <- c('BE', 'NL', 'NL','NL','BE','NL','BE','BE','NL')
closeday <- c('2017-08-23', '2017-08-05', '2017-08-22', '2017-08-26',
'2017-08-25', '2017-08-13', '2017-08-30', '2017-08-05', '2017-08-23')
closeday <- as.Date(closeday)
df <- data.frame(customer,country,closeday)
添加status
:
df$status <- ifelse(df$closeday < '2017-08-20', 'open', 'closed')
customer country closeday status
1 1 BE 2017-08-23 closed
2 2 NL 2017-08-05 open
3 3 NL 2017-08-22 closed
4 4 NL 2017-08-26 closed
5 5 BE 2017-08-25 closed
6 6 NL 2017-08-13 open
7 7 BE 2017-08-30 closed
8 8 BE 2017-08-05 open
9 9 NL 2017-08-23 closed
计算closerate
closerate <- length(which(df$status == 'closed')) /
(length(which(df$status == 'closed')) + length(which(df$status == 'open')))
[1] 0.6666667
显然,这是 closerate
的总数。挑战在于获得每个 country
的 closerate
。我尝试通过以下方式将 closerate
计算添加到 df
:
df$closerate <- length(which(df$status == 'closed')) /
(length(which(df$status == 'closed')) + length(which(df$status == 'open')))
但它为所有行提供了 0.66 的 closerate
,因为我没有分组。我相信我不应该使用长度函数,因为计数可以通过分组来完成。我阅读了一些有关使用 dplyr
计算每组逻辑输出的信息,但这没有成功。
这是所需的输出:
您可以使用 tapply
:
data.frame(open=tapply(df$status=="open", df$country, sum),
closed=tapply(df$status=="closed", df$country, sum)
closerate=tapply(df$status=="closed", df$country, mean))`
aggregate(list(output = df$status == "closed"),
list(country = df$country),
function(x)
c(close = sum(x),
open = length(x) - sum(x),
rate = mean(x)))
# country output.close output.open output.rate
#1 BE 3.00 1.00 0.75
#2 NL 3.00 2.00 0.60
评论中使用table
的解决方案似乎已被删除。无论如何,你也可以使用 table
output = as.data.frame.matrix(table(df$country, df$status))
output$closerate = output$closed/(output$closed + output$open)
output
# closed open closerate
#BE 3 1 0.75
#NL 3 2 0.60
一个data.table
方法是。
library(data.table)
setDT(df)[, {temp <- status=="closed"; # store temporary logical variable
.(closed=sum(temp), open=sum(!temp), closeRate=mean(temp))}, # calculate stuff
by=country] # by country
哪个returns
country closed open closeRate
1: BE 3 1 0.75
2: NL 3 2 0.60
这是一个dplyr
解决方案。
output <- df %>%
count(country, status) %>%
group_by(country) %>%
mutate(total = sum(n)) %>%
mutate(percent = n/total)
Returns...
output
country status n total percent
BE closed 3 4 0.75
BE open 1 4 0.25
NL closed 3 5 0.60
NL open 2 5 0.40
这是一个使用 tidyverse
的快速解决方案:
library(dplyr)
df %>% group_by(country) %>%
mutate(status =ifelse(closeday < '2017-08-20', 'open', 'closed'),
closerate=mean(status=="closed"))
返回:
# A tibble: 9 x 5
# Groups: country [2]
customer country closeday status closerate
<dbl> <fctr> <date> <chr> <dbl>
1 1 BE 2017-08-23 closed 0.75
2 2 NL 2017-08-05 open 0.60
3 3 NL 2017-08-22 closed 0.60
4 4 NL 2017-08-26 closed 0.60
5 5 BE 2017-08-25 closed 0.75
6 6 NL 2017-08-13 open 0.60
7 7 BE 2017-08-30 closed 0.75
8 8 BE 2017-08-05 open 0.75
9 9 NL 2017-08-23 closed 0.60
在这里,当 TRUE/FALSE 的向量被放入 mean()
函数时,我利用逻辑强制转换为整数。
或者,data.table
:
library(data.table)
setDT(df)[,status:=ifelse(closeday < '2017-08-20', 'open', 'closed')]
df[, .(closerate=mean(status=="closed")), by=country]
我想计算每个 country
status
的次数 open
和 status
的次数 closed
。然后计算每个 country
的 closerate
。
数据:
customer <- c(1,2,3,4,5,6,7,8,9)
country <- c('BE', 'NL', 'NL','NL','BE','NL','BE','BE','NL')
closeday <- c('2017-08-23', '2017-08-05', '2017-08-22', '2017-08-26',
'2017-08-25', '2017-08-13', '2017-08-30', '2017-08-05', '2017-08-23')
closeday <- as.Date(closeday)
df <- data.frame(customer,country,closeday)
添加status
:
df$status <- ifelse(df$closeday < '2017-08-20', 'open', 'closed')
customer country closeday status
1 1 BE 2017-08-23 closed
2 2 NL 2017-08-05 open
3 3 NL 2017-08-22 closed
4 4 NL 2017-08-26 closed
5 5 BE 2017-08-25 closed
6 6 NL 2017-08-13 open
7 7 BE 2017-08-30 closed
8 8 BE 2017-08-05 open
9 9 NL 2017-08-23 closed
计算closerate
closerate <- length(which(df$status == 'closed')) /
(length(which(df$status == 'closed')) + length(which(df$status == 'open')))
[1] 0.6666667
显然,这是 closerate
的总数。挑战在于获得每个 country
的 closerate
。我尝试通过以下方式将 closerate
计算添加到 df
:
df$closerate <- length(which(df$status == 'closed')) /
(length(which(df$status == 'closed')) + length(which(df$status == 'open')))
但它为所有行提供了 0.66 的 closerate
,因为我没有分组。我相信我不应该使用长度函数,因为计数可以通过分组来完成。我阅读了一些有关使用 dplyr
计算每组逻辑输出的信息,但这没有成功。
这是所需的输出:
您可以使用 tapply
:
data.frame(open=tapply(df$status=="open", df$country, sum),
closed=tapply(df$status=="closed", df$country, sum)
closerate=tapply(df$status=="closed", df$country, mean))`
aggregate(list(output = df$status == "closed"),
list(country = df$country),
function(x)
c(close = sum(x),
open = length(x) - sum(x),
rate = mean(x)))
# country output.close output.open output.rate
#1 BE 3.00 1.00 0.75
#2 NL 3.00 2.00 0.60
评论中使用table
的解决方案似乎已被删除。无论如何,你也可以使用 table
output = as.data.frame.matrix(table(df$country, df$status))
output$closerate = output$closed/(output$closed + output$open)
output
# closed open closerate
#BE 3 1 0.75
#NL 3 2 0.60
一个data.table
方法是。
library(data.table)
setDT(df)[, {temp <- status=="closed"; # store temporary logical variable
.(closed=sum(temp), open=sum(!temp), closeRate=mean(temp))}, # calculate stuff
by=country] # by country
哪个returns
country closed open closeRate
1: BE 3 1 0.75
2: NL 3 2 0.60
这是一个dplyr
解决方案。
output <- df %>%
count(country, status) %>%
group_by(country) %>%
mutate(total = sum(n)) %>%
mutate(percent = n/total)
Returns...
output
country status n total percent
BE closed 3 4 0.75
BE open 1 4 0.25
NL closed 3 5 0.60
NL open 2 5 0.40
这是一个使用 tidyverse
的快速解决方案:
library(dplyr)
df %>% group_by(country) %>%
mutate(status =ifelse(closeday < '2017-08-20', 'open', 'closed'),
closerate=mean(status=="closed"))
返回:
# A tibble: 9 x 5
# Groups: country [2]
customer country closeday status closerate
<dbl> <fctr> <date> <chr> <dbl>
1 1 BE 2017-08-23 closed 0.75
2 2 NL 2017-08-05 open 0.60
3 3 NL 2017-08-22 closed 0.60
4 4 NL 2017-08-26 closed 0.60
5 5 BE 2017-08-25 closed 0.75
6 6 NL 2017-08-13 open 0.60
7 7 BE 2017-08-30 closed 0.75
8 8 BE 2017-08-05 open 0.75
9 9 NL 2017-08-23 closed 0.60
在这里,当 TRUE/FALSE 的向量被放入 mean()
函数时,我利用逻辑强制转换为整数。
或者,data.table
:
library(data.table)
setDT(df)[,status:=ifelse(closeday < '2017-08-20', 'open', 'closed')]
df[, .(closerate=mean(status=="closed")), by=country]