在 R 中做四分位数时如何使用 case_when?

How to use case_when when doing quartiles in R?

如果我有这个问题:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2011END"),
  date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010'),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "youtube"
  ),
  method = c("website",
             "phone",
             "website",
             "laptop",
             "phone"),
  values = c(1, NA, 1, 2, 3))

然后我有这个数据框,它告诉您要创建哪些分位数以及要根据排名进行的排名:

tibble(
  method = c(
    "phone",
    "phone",
    "phone",
    "website",
    "website",
    "website",
    "laptop",
    "laptop",
    "laptop"
  ), 
  rank = c(3,2,1,3,2,1,3,2,1), 
  tile_condition = c("lowest 25%", "25 to 50%", "more than 50%", 
                     "highest 25%", "25 to 50%", "less than 25%", 
                     "lowest 25%", "25 to 50%", "more than 50%")
)

我如何使用 case_when 语句正确地允许我自己创建基于第一个数据框中值列的四分位数计算的排名列?

我正在尝试应用其他数据框中的分位数在原始数据框中创建排名列 - 一直在研究如何使用 case_when。

如果我没有正确理解你的问题,你首先必须创建一个 table 来进行比较,如:

df_quants <- 
    df1 %>% 
    drop_na(values) %>% 
    group_by(method) %>% 
    summarize(quant25 = quantile(values, probs = 0.25), 
              quant50 = quantile(values, probs = 0.5), 
              quant75 = quantile(values, probs = 0.75), 
              quant100 = quantile(values, probs = 1))

然后,使用连接和 case_when 语句,您将得到:

df2 %>% 
    left_join(df_quants, by = 'method') %>% 
    mutate(tiles = 
        case_when(rank < quant25 ~ 'lowest 25%', 
                  rank < quant50 ~ '25 to 50%', 
                  rank < quant75 ~ 'more than 50%', 
                  rank >= quant75 ~ 'highest 25%'))

我会这样做:

set.seed(124)

left_join(
  df1[sample(1:5,1000, replace=T),] %>% 
    mutate(values=sample(c(df1$values,1:30),1000, replace=T)) %>% 
    group_by(method) %>% 
    mutate(q=as.double(cut(values,quantile(values,probs=seq(0,1,0.25), na.rm=T), labels=c(1:4), include.lowest=T))) %>% 
    ungroup(),
  df2 %>% mutate(q = list(1,2,c(3,4),4,c(2,3),1,1,2,c(3,4))) %>% unnest(q),
  by=c("method", "q")
) %>% select(-q)

输出:

# A tibble: 1,000 × 7
   period  date       website  method  values  rank tile_condition
   <chr>   <chr>      <chr>    <chr>    <dbl> <dbl> <chr>         
 1 2010END 31-12-2010 google   website      7     2 25 to 75%     
 2 2011END 31-12-2011 google   phone       18     1 more than 50% 
 3 2010Q1  30-04-2010 facebook website     21     2 25 to 75%     
 4 2011END 30-09-2010 youtube  phone       15     1 more than 50% 
 5 2011END 30-09-2010 youtube  phone       26     1 more than 50% 
 6 2011END 31-12-2011 google   phone        3     3 lowest 25%    
 7 2010END 31-12-2010 google   website      1     1 less than 25% 
 8 2010Q1  30-04-2010 facebook website      2     1 less than 25% 
 9 2010Q2  31-07-2010 facebook laptop      14     2 25 to 50%     
10 2010Q2  31-07-2010 facebook laptop      16     1 more than 50% 
# … with 990 more rows

请注意,出于说明目的,我将您的输入更新为 1000 行和随机新值。另外,请注意我修复了 df2,因此方法 website 涵盖了整个值范围。在您的示例中,缺少 50% 到 75% 的四分位数。

调整后的 df2 输入:

structure(list(method = c("phone", "phone", "phone", "website", 
"website", "website", "laptop", "laptop", "laptop"), rank = c(3, 
2, 1, 3, 2, 1, 3, 2, 1), tile_condition = c("lowest 25%", "25 to 50%", 
"more than 50%", "highest 25%", "25 to 75%", "less than 25%", 
"lowest 25%", "25 to 50%", "more than 50%")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -9L))

这是一个快速版本。它没有得到你想要的确切标签。为此,您必须解析 tile_condition 列,这有点棘手。

library(santoku)
df |> 
  group_by(method) |>
  mutate(
    quantile = chop_quantiles(values, c(0.25, 0.50), 
      labels = c("Lowest 25%", "25 to 50%", "Above 50%"), extend = TRUE)
  )