用于计算精度和召回率的 Tidyverse 语法
Tidyverse syntax for calculating precision and recall
我正在尝试为我的数据框中的每个组计算 AUC、精度、召回率、准确度(我有一个数据框,其中包含来自三个不同模型的预测数据)。
执行此操作的 tidyverse 语法是什么?我想使用 Max Kuhn 的 yardstick 包来计算这些指标。
这是一个 df 示例,这是我到目前为止的结果:
> library(tidyverse)
> library(yardstick)
>
> sample_df <- data_frame(
+ group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
+ true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
+ pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
+ ) %>%
+ mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
>
> sample_df
# A tibble: 15 x 4
group_type true_label pred_prob pred_label
<chr> <fct> <dbl> <fct>
1 a 1 0.327 0
2 a 1 0.286 0
3 a 0 0.0662 0
4 a 0 0.993 1
5 a 0 0.835 1
6 b 0 0.975 1
7 b 0 0.436 0
8 b 0 0.585 1
9 b 0 0.478 0
10 b 1 0.541 1
11 c 1 0.247 0
12 c 0 0.608 1
13 c 0 0.215 0
14 c 0 0.937 1
15 c 0 0.819 1
>
指标:
> # metrics for the full data
> precision(sample_df, truth = true_label, estimate = pred_label)
[1] 0.5714286
> recall(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3636364
> accuracy(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3333333
> roc_auc(sample_df, truth = true_label, pred_prob)
[1] 0.7727273
>
现在如何获取数据集中每个组的这些指标?
sample_df %>%
group_by(group_type) %>%
summarize(???)
我设法通过将数据框吐到列表并将函数映射到每个列表元素来做到这一点:
library(tidyverse)
library(yardstick)
sample_df %>%
split(.$group_type) %>%
map_dfr(precision, true_label, pred_label)
#output
## A tibble: 1 x 3
a b c
<dbl> <dbl> <dbl>
1 0.500 0.667 1.00
yardstick
函数似乎还不支持 group_by
这也有效:
sample_df %>%
split(.$group_type) %>%
map_dfr(function(x){
prec = precision(x, true_label, pred_label)
rec = recall(x, true_label, pred_label)
return(data.frame(prec, rec))
})
一个使用 unnest 的例子:
sample_df %>%
group_by(group_type) %>%
do(auc = roc_auc(., true_label, pred_prob),
acc = accuracy(., true_label, pred_label),
recall = recall(., true_label, pred_label),
precision = precision(., true_label, pred_label)) %>% unnest
然而,
我实际上建议不要使用 yardstick,因为它不能很好地与 dplyr summarize 一起使用。实际上,它只是在底层使用了 ROCR 包。我只想制作你自己的函数,它接受两个变量。
yardstick
是有缺陷的,因为它需要一个 data.frame
作为它的第一个输入,它试图变得太聪明了。在 dplyr 框架下,这不是必需的,因为 summarize
和 mutate
因为函数已经在没有显式 data
参数的情况下看到 data.frame
中的变量。
我用了http://r4ds.had.co.nz/many-models.html中的例子
它使用嵌套,但也根据您的要求使用精度。
library(tidyverse)
library(yardstick)
sample_df <- data_frame(group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
) %>%
mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
by_group_type <- sample_df %>% group_by(group_type) %>% nest()
stick_m_1 <- function(df){
precision(df,truth = true_label, estimate = pred_label)
}
models <- map(by_group_type$data,stick_m_1)
models
正如其他人所指出的那样,yardstick
中的函数在处理分组数据帧时表现不佳(至少目前如此)。一种解决方法是使用嵌套数据。
为了减少复制,编写一个简单的包装函数来计算一次调用中所需的所有摘要指标可能也是一个好主意。以下是您如何做到这一点的示例:
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-09
首先设置:
library(tidyverse)
library(yardstick)
set.seed(1)
# Given sample data
sample_df <- data_frame(
group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
) %>%
mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
#> Warning: package 'bindrcpp' was built under R version 3.3.3
这是包装纸:
# Wrapper to calculate several metrics from same data
performance_metrics <- function(data, truth, estimate, prob) {
metrics <- lst(precision, recall, accuracy) # these all share arguments
values <- invoke_map_df(metrics, list(list(data)), truth, estimate)
roc <- roc_auc(sample_df, truth, prob) # bit different here
bind_cols(values, roc_auc = roc)
}
# Wrap the wrapper with default arguments
metrics <- partial(performance_metrics,
truth = "true_label",
estimate = "pred_label",
prob = "pred_prob")
并通过嵌套数据应用于组:
sample_df %>%
nest(-group_type) %>%
mutate(metrics = map(data, metrics)) %>%
unnest(metrics)
#> # A tibble: 3 x 6
#> group_type data precision recall accuracy roc_auc
#> <chr> <list> <dbl> <dbl> <dbl> <dbl>
#> 1 a <tibble [5 x 3]> 0.5000000 0.2500000 0.2 0.5909091
#> 2 b <tibble [5 x 3]> 0.6666667 0.6666667 0.6 0.5909091
#> 3 c <tibble [5 x 3]> 0.7500000 0.7500000 0.6 0.5909091
我正在尝试为我的数据框中的每个组计算 AUC、精度、召回率、准确度(我有一个数据框,其中包含来自三个不同模型的预测数据)。
执行此操作的 tidyverse 语法是什么?我想使用 Max Kuhn 的 yardstick 包来计算这些指标。
这是一个 df 示例,这是我到目前为止的结果:
> library(tidyverse)
> library(yardstick)
>
> sample_df <- data_frame(
+ group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
+ true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
+ pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
+ ) %>%
+ mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
>
> sample_df
# A tibble: 15 x 4
group_type true_label pred_prob pred_label
<chr> <fct> <dbl> <fct>
1 a 1 0.327 0
2 a 1 0.286 0
3 a 0 0.0662 0
4 a 0 0.993 1
5 a 0 0.835 1
6 b 0 0.975 1
7 b 0 0.436 0
8 b 0 0.585 1
9 b 0 0.478 0
10 b 1 0.541 1
11 c 1 0.247 0
12 c 0 0.608 1
13 c 0 0.215 0
14 c 0 0.937 1
15 c 0 0.819 1
>
指标:
> # metrics for the full data
> precision(sample_df, truth = true_label, estimate = pred_label)
[1] 0.5714286
> recall(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3636364
> accuracy(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3333333
> roc_auc(sample_df, truth = true_label, pred_prob)
[1] 0.7727273
>
现在如何获取数据集中每个组的这些指标?
sample_df %>%
group_by(group_type) %>%
summarize(???)
我设法通过将数据框吐到列表并将函数映射到每个列表元素来做到这一点:
library(tidyverse)
library(yardstick)
sample_df %>%
split(.$group_type) %>%
map_dfr(precision, true_label, pred_label)
#output
## A tibble: 1 x 3
a b c
<dbl> <dbl> <dbl>
1 0.500 0.667 1.00
yardstick
函数似乎还不支持 group_by
这也有效:
sample_df %>%
split(.$group_type) %>%
map_dfr(function(x){
prec = precision(x, true_label, pred_label)
rec = recall(x, true_label, pred_label)
return(data.frame(prec, rec))
})
一个使用 unnest 的例子:
sample_df %>%
group_by(group_type) %>%
do(auc = roc_auc(., true_label, pred_prob),
acc = accuracy(., true_label, pred_label),
recall = recall(., true_label, pred_label),
precision = precision(., true_label, pred_label)) %>% unnest
然而,
我实际上建议不要使用 yardstick,因为它不能很好地与 dplyr summarize 一起使用。实际上,它只是在底层使用了 ROCR 包。我只想制作你自己的函数,它接受两个变量。
yardstick
是有缺陷的,因为它需要一个 data.frame
作为它的第一个输入,它试图变得太聪明了。在 dplyr 框架下,这不是必需的,因为 summarize
和 mutate
因为函数已经在没有显式 data
参数的情况下看到 data.frame
中的变量。
我用了http://r4ds.had.co.nz/many-models.html中的例子 它使用嵌套,但也根据您的要求使用精度。
library(tidyverse)
library(yardstick)
sample_df <- data_frame(group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
) %>%
mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
by_group_type <- sample_df %>% group_by(group_type) %>% nest()
stick_m_1 <- function(df){
precision(df,truth = true_label, estimate = pred_label)
}
models <- map(by_group_type$data,stick_m_1)
models
正如其他人所指出的那样,yardstick
中的函数在处理分组数据帧时表现不佳(至少目前如此)。一种解决方法是使用嵌套数据。
为了减少复制,编写一个简单的包装函数来计算一次调用中所需的所有摘要指标可能也是一个好主意。以下是您如何做到这一点的示例:
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-09
首先设置:
library(tidyverse)
library(yardstick)
set.seed(1)
# Given sample data
sample_df <- data_frame(
group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
) %>%
mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
#> Warning: package 'bindrcpp' was built under R version 3.3.3
这是包装纸:
# Wrapper to calculate several metrics from same data
performance_metrics <- function(data, truth, estimate, prob) {
metrics <- lst(precision, recall, accuracy) # these all share arguments
values <- invoke_map_df(metrics, list(list(data)), truth, estimate)
roc <- roc_auc(sample_df, truth, prob) # bit different here
bind_cols(values, roc_auc = roc)
}
# Wrap the wrapper with default arguments
metrics <- partial(performance_metrics,
truth = "true_label",
estimate = "pred_label",
prob = "pred_prob")
并通过嵌套数据应用于组:
sample_df %>%
nest(-group_type) %>%
mutate(metrics = map(data, metrics)) %>%
unnest(metrics)
#> # A tibble: 3 x 6
#> group_type data precision recall accuracy roc_auc
#> <chr> <list> <dbl> <dbl> <dbl> <dbl>
#> 1 a <tibble [5 x 3]> 0.5000000 0.2500000 0.2 0.5909091
#> 2 b <tibble [5 x 3]> 0.6666667 0.6666667 0.6 0.5909091
#> 3 c <tibble [5 x 3]> 0.7500000 0.7500000 0.6 0.5909091