在 R 中做四分位数时如何使用 case_when?
How to use case_when when doing quartiles in R?
如果我有这个问题:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2011END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010'),
website = c(
"google",
"google",
"facebook",
"facebook",
"youtube"
),
method = c("website",
"phone",
"website",
"laptop",
"phone"),
values = c(1, NA, 1, 2, 3))
然后我有这个数据框,它告诉您要创建哪些分位数以及要根据排名进行的排名:
tibble(
method = c(
"phone",
"phone",
"phone",
"website",
"website",
"website",
"laptop",
"laptop",
"laptop"
),
rank = c(3,2,1,3,2,1,3,2,1),
tile_condition = c("lowest 25%", "25 to 50%", "more than 50%",
"highest 25%", "25 to 50%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")
)
我如何使用 case_when 语句正确地允许我自己创建基于第一个数据框中值列的四分位数计算的排名列?
我正在尝试应用其他数据框中的分位数在原始数据框中创建排名列 - 一直在研究如何使用 case_when。
如果我没有正确理解你的问题,你首先必须创建一个 table 来进行比较,如:
df_quants <-
df1 %>%
drop_na(values) %>%
group_by(method) %>%
summarize(quant25 = quantile(values, probs = 0.25),
quant50 = quantile(values, probs = 0.5),
quant75 = quantile(values, probs = 0.75),
quant100 = quantile(values, probs = 1))
然后,使用连接和 case_when
语句,您将得到:
df2 %>%
left_join(df_quants, by = 'method') %>%
mutate(tiles =
case_when(rank < quant25 ~ 'lowest 25%',
rank < quant50 ~ '25 to 50%',
rank < quant75 ~ 'more than 50%',
rank >= quant75 ~ 'highest 25%'))
我会这样做:
set.seed(124)
left_join(
df1[sample(1:5,1000, replace=T),] %>%
mutate(values=sample(c(df1$values,1:30),1000, replace=T)) %>%
group_by(method) %>%
mutate(q=as.double(cut(values,quantile(values,probs=seq(0,1,0.25), na.rm=T), labels=c(1:4), include.lowest=T))) %>%
ungroup(),
df2 %>% mutate(q = list(1,2,c(3,4),4,c(2,3),1,1,2,c(3,4))) %>% unnest(q),
by=c("method", "q")
) %>% select(-q)
输出:
# A tibble: 1,000 × 7
period date website method values rank tile_condition
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 2010END 31-12-2010 google website 7 2 25 to 75%
2 2011END 31-12-2011 google phone 18 1 more than 50%
3 2010Q1 30-04-2010 facebook website 21 2 25 to 75%
4 2011END 30-09-2010 youtube phone 15 1 more than 50%
5 2011END 30-09-2010 youtube phone 26 1 more than 50%
6 2011END 31-12-2011 google phone 3 3 lowest 25%
7 2010END 31-12-2010 google website 1 1 less than 25%
8 2010Q1 30-04-2010 facebook website 2 1 less than 25%
9 2010Q2 31-07-2010 facebook laptop 14 2 25 to 50%
10 2010Q2 31-07-2010 facebook laptop 16 1 more than 50%
# … with 990 more rows
请注意,出于说明目的,我将您的输入更新为 1000 行和随机新值。另外,请注意我修复了 df2,因此方法 website
涵盖了整个值范围。在您的示例中,缺少 50% 到 75% 的四分位数。
调整后的 df2 输入:
structure(list(method = c("phone", "phone", "phone", "website",
"website", "website", "laptop", "laptop", "laptop"), rank = c(3,
2, 1, 3, 2, 1, 3, 2, 1), tile_condition = c("lowest 25%", "25 to 50%",
"more than 50%", "highest 25%", "25 to 75%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L))
这是一个快速版本。它没有得到你想要的确切标签。为此,您必须解析 tile_condition
列,这有点棘手。
library(santoku)
df |>
group_by(method) |>
mutate(
quantile = chop_quantiles(values, c(0.25, 0.50),
labels = c("Lowest 25%", "25 to 50%", "Above 50%"), extend = TRUE)
)
如果我有这个问题:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2011END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010'),
website = c(
"google",
"google",
"facebook",
"facebook",
"youtube"
),
method = c("website",
"phone",
"website",
"laptop",
"phone"),
values = c(1, NA, 1, 2, 3))
然后我有这个数据框,它告诉您要创建哪些分位数以及要根据排名进行的排名:
tibble(
method = c(
"phone",
"phone",
"phone",
"website",
"website",
"website",
"laptop",
"laptop",
"laptop"
),
rank = c(3,2,1,3,2,1,3,2,1),
tile_condition = c("lowest 25%", "25 to 50%", "more than 50%",
"highest 25%", "25 to 50%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")
)
我如何使用 case_when 语句正确地允许我自己创建基于第一个数据框中值列的四分位数计算的排名列?
我正在尝试应用其他数据框中的分位数在原始数据框中创建排名列 - 一直在研究如何使用 case_when。
如果我没有正确理解你的问题,你首先必须创建一个 table 来进行比较,如:
df_quants <-
df1 %>%
drop_na(values) %>%
group_by(method) %>%
summarize(quant25 = quantile(values, probs = 0.25),
quant50 = quantile(values, probs = 0.5),
quant75 = quantile(values, probs = 0.75),
quant100 = quantile(values, probs = 1))
然后,使用连接和 case_when
语句,您将得到:
df2 %>%
left_join(df_quants, by = 'method') %>%
mutate(tiles =
case_when(rank < quant25 ~ 'lowest 25%',
rank < quant50 ~ '25 to 50%',
rank < quant75 ~ 'more than 50%',
rank >= quant75 ~ 'highest 25%'))
我会这样做:
set.seed(124)
left_join(
df1[sample(1:5,1000, replace=T),] %>%
mutate(values=sample(c(df1$values,1:30),1000, replace=T)) %>%
group_by(method) %>%
mutate(q=as.double(cut(values,quantile(values,probs=seq(0,1,0.25), na.rm=T), labels=c(1:4), include.lowest=T))) %>%
ungroup(),
df2 %>% mutate(q = list(1,2,c(3,4),4,c(2,3),1,1,2,c(3,4))) %>% unnest(q),
by=c("method", "q")
) %>% select(-q)
输出:
# A tibble: 1,000 × 7
period date website method values rank tile_condition
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 2010END 31-12-2010 google website 7 2 25 to 75%
2 2011END 31-12-2011 google phone 18 1 more than 50%
3 2010Q1 30-04-2010 facebook website 21 2 25 to 75%
4 2011END 30-09-2010 youtube phone 15 1 more than 50%
5 2011END 30-09-2010 youtube phone 26 1 more than 50%
6 2011END 31-12-2011 google phone 3 3 lowest 25%
7 2010END 31-12-2010 google website 1 1 less than 25%
8 2010Q1 30-04-2010 facebook website 2 1 less than 25%
9 2010Q2 31-07-2010 facebook laptop 14 2 25 to 50%
10 2010Q2 31-07-2010 facebook laptop 16 1 more than 50%
# … with 990 more rows
请注意,出于说明目的,我将您的输入更新为 1000 行和随机新值。另外,请注意我修复了 df2,因此方法 website
涵盖了整个值范围。在您的示例中,缺少 50% 到 75% 的四分位数。
调整后的 df2 输入:
structure(list(method = c("phone", "phone", "phone", "website",
"website", "website", "laptop", "laptop", "laptop"), rank = c(3,
2, 1, 3, 2, 1, 3, 2, 1), tile_condition = c("lowest 25%", "25 to 50%",
"more than 50%", "highest 25%", "25 to 75%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L))
这是一个快速版本。它没有得到你想要的确切标签。为此,您必须解析 tile_condition
列,这有点棘手。
library(santoku)
df |>
group_by(method) |>
mutate(
quantile = chop_quantiles(values, c(0.25, 0.50),
labels = c("Lowest 25%", "25 to 50%", "Above 50%"), extend = TRUE)
)