在数据框中创建自定义分位数?
Creating custom Quantiles within data frame?
如果我有以下 table:
tibble(year = c("2020", "2020", "2020","2021", "2021", "2021"),
website = c("facebook", "google", "youtube","facebook", "google", "youtube"),
method = c("laptop", "laptop", "laptop", "mobile", "mobile", "mobile"),
values = c(10,30,60, 90,25, 40))
我如何尝试根据值列中数字的自定义 q-tile 创建列。
例如,如果我有以下自定义 q-tile 条件:
风险 - > 50%
两者都不是 - 25-50%
安全 - <25%
这些基本上是说对于值列中的数字,根据上面的 q-tile 条件计算他们的排名,并相应地给他们一个排名值 1,2,3。
最后的 table 应该是这样的:
tibble(year = c("2020", "2020", "2020","2021", "2021", "2021"),
website = c("facebook", "google", "youtube","facebook", "google", "youtube"),
method = c("laptop", "laptop", "laptop", "mobile", "mobile", "mobile"),
values = c(10,30,60, 90,25, 40),
rank = c(3,2,1,1,3,2))
我知道 table 必须按年份和方法分组,因此代码如下所示:
df %>% group_by(year, method) %>% mutate(rank = quantile(???))
您可以使用 dplyr
中的 ntile
函数创建分位数:
library(dplyr)
df %>%
group_by(year, method) %>%
mutate(rank = ntile(values, 4))
输出:
# A tibble: 6 × 5
# Groups: year, method [2]
year website method values rank
<chr> <chr> <chr> <dbl> <int>
1 2020 facebook laptop 10 1
2 2020 google laptop 30 2
3 2020 youtube laptop 60 3
4 2021 facebook mobile 90 3
5 2021 google mobile 25 1
6 2021 youtube mobile 40 2
df %>%
group_by(year, method) %>%
mutate(rank = rank(-cut(values, breaks = c(-Inf, quantile(values, probs = c(0.25, 0.50), names = F), Inf), labels = F)))
# # A tibble: 6 x 5
# # Groups: year, method [2]
# year website method values rank
# <chr> <chr> <chr> <dbl> <dbl>
# 1 2020 facebook laptop 10 3
# 2 2020 google laptop 30 2
# 3 2020 youtube laptop 60 1
# 4 2021 facebook mobile 90 1
# 5 2021 google mobile 25 3
# 6 2021 youtube mobile 40 2
您可以使用 quantile(x, c(0.25, 0.5))
获取切点并将它们传递给 findInterval()
。请注意 findInterval()
类似于 cut(*, labels = FALSE)
但效率更高。
library(dplyr)
df %>%
group_by(year, method) %>%
mutate(rank = findInterval(-values, quantile(-values, c(0.25, 0.5)), left.open = TRUE) + 1) %>%
ungroup()
# # A tibble: 6 × 5
# year website method values rank
# <chr> <chr> <chr> <dbl> <dbl>
# 1 2020 facebook laptop 10 3
# 2 2020 google laptop 30 2
# 3 2020 youtube laptop 60 1
# 4 2021 facebook mobile 90 1
# 5 2021 google mobile 25 3
# 6 2021 youtube mobile 40 2
如果您想要标签而不是排名,请使用 cut()
:
df %>%
group_by(year, method) %>%
mutate(rank = cut(values, quantile(values, c(0, 0.25, 0.5, 1)),
c("Safe", "Neither", "Risky"), include.lowest = TRUE)) %>%
ungroup()
# # A tibble: 6 × 5
# year website method values rank
# <chr> <chr> <chr> <dbl> <fct>
# 1 2020 facebook laptop 10 Safe
# 2 2020 google laptop 30 Neither
# 3 2020 youtube laptop 60 Risky
# 4 2021 facebook mobile 90 Risky
# 5 2021 google mobile 25 Safe
# 6 2021 youtube mobile 40 Neither
A {santoku}
one-liner:
mutate(df,
rank = santoku::chop_quantiles(rank, c(0.25, 0.5),
labels = c("Safe", "Neither", "Risky"))
)
如果我有以下 table:
tibble(year = c("2020", "2020", "2020","2021", "2021", "2021"),
website = c("facebook", "google", "youtube","facebook", "google", "youtube"),
method = c("laptop", "laptop", "laptop", "mobile", "mobile", "mobile"),
values = c(10,30,60, 90,25, 40))
我如何尝试根据值列中数字的自定义 q-tile 创建列。
例如,如果我有以下自定义 q-tile 条件:
风险 - > 50% 两者都不是 - 25-50% 安全 - <25%
这些基本上是说对于值列中的数字,根据上面的 q-tile 条件计算他们的排名,并相应地给他们一个排名值 1,2,3。
最后的 table 应该是这样的:
tibble(year = c("2020", "2020", "2020","2021", "2021", "2021"),
website = c("facebook", "google", "youtube","facebook", "google", "youtube"),
method = c("laptop", "laptop", "laptop", "mobile", "mobile", "mobile"),
values = c(10,30,60, 90,25, 40),
rank = c(3,2,1,1,3,2))
我知道 table 必须按年份和方法分组,因此代码如下所示:
df %>% group_by(year, method) %>% mutate(rank = quantile(???))
您可以使用 dplyr
中的 ntile
函数创建分位数:
library(dplyr)
df %>%
group_by(year, method) %>%
mutate(rank = ntile(values, 4))
输出:
# A tibble: 6 × 5
# Groups: year, method [2]
year website method values rank
<chr> <chr> <chr> <dbl> <int>
1 2020 facebook laptop 10 1
2 2020 google laptop 30 2
3 2020 youtube laptop 60 3
4 2021 facebook mobile 90 3
5 2021 google mobile 25 1
6 2021 youtube mobile 40 2
df %>%
group_by(year, method) %>%
mutate(rank = rank(-cut(values, breaks = c(-Inf, quantile(values, probs = c(0.25, 0.50), names = F), Inf), labels = F)))
# # A tibble: 6 x 5
# # Groups: year, method [2]
# year website method values rank
# <chr> <chr> <chr> <dbl> <dbl>
# 1 2020 facebook laptop 10 3
# 2 2020 google laptop 30 2
# 3 2020 youtube laptop 60 1
# 4 2021 facebook mobile 90 1
# 5 2021 google mobile 25 3
# 6 2021 youtube mobile 40 2
您可以使用 quantile(x, c(0.25, 0.5))
获取切点并将它们传递给 findInterval()
。请注意 findInterval()
类似于 cut(*, labels = FALSE)
但效率更高。
library(dplyr)
df %>%
group_by(year, method) %>%
mutate(rank = findInterval(-values, quantile(-values, c(0.25, 0.5)), left.open = TRUE) + 1) %>%
ungroup()
# # A tibble: 6 × 5
# year website method values rank
# <chr> <chr> <chr> <dbl> <dbl>
# 1 2020 facebook laptop 10 3
# 2 2020 google laptop 30 2
# 3 2020 youtube laptop 60 1
# 4 2021 facebook mobile 90 1
# 5 2021 google mobile 25 3
# 6 2021 youtube mobile 40 2
如果您想要标签而不是排名,请使用 cut()
:
df %>%
group_by(year, method) %>%
mutate(rank = cut(values, quantile(values, c(0, 0.25, 0.5, 1)),
c("Safe", "Neither", "Risky"), include.lowest = TRUE)) %>%
ungroup()
# # A tibble: 6 × 5
# year website method values rank
# <chr> <chr> <chr> <dbl> <fct>
# 1 2020 facebook laptop 10 Safe
# 2 2020 google laptop 30 Neither
# 3 2020 youtube laptop 60 Risky
# 4 2021 facebook mobile 90 Risky
# 5 2021 google mobile 25 Safe
# 6 2021 youtube mobile 40 Neither
A {santoku}
one-liner:
mutate(df,
rank = santoku::chop_quantiles(rank, c(0.25, 0.5),
labels = c("Safe", "Neither", "Risky"))
)