查找一列中的值是否在其他几列的范围内

Finding if the values in one column are within the range of several other columns

我正在寻找一种简单的方法来确定列中的值是否在其他列中的值范围内。

我的输入是这样的:

ID  "Q1 Comm - 01 Scope Thesis" "Q1 Comm - 02 Scope Project" "Q1 Comm - 03 Learn Intern"    "Q1 Comm - 04 Biography"    "Q1 Comm - Overall Plan"
10   NA                          NA                           4                              NA      4
31   2                           NA                           NA                             NA      2
225  0                           NA                           NA                             NA      1
243  NA                          2                            NA                             1       0
310  NA                          2                            NA                             1       NA

对于每个唯一 ID,我有兴趣确定列 Q1 Comm - Overall Plan 何时为:

1 - Below 所有其他列的 min(),或

2 - Above 所有其他列的 max(),或

3 - Within 所有其他列的范围

完整的列列表(以及 overall 列)如下:

"Q1 Comm - 01 Scope Thesis"
"Q1 Comm - 02 Scope Project"
"Q1 Comm - 03 Learn Intern"
"Q1 Comm - 04 Biography"
"Q1 Comm - 05 Exhibit"
"Q1 Comm - 06 Social Act"
"Q1 Comm - 07 Post Project"
"Q1 Comm - 08 Learn Plant"
"Q1 Comm - 09 Study Narrate"
"Q1 Comm - 10 Learn Participate"
"Q1 Comm - 11 Write 1"
"Q1 Comm - 12 Read 2"
"Q1 Comm - Overall Plan"

我需要的输出是这样的:

ID  "Q1 Comm - 01 Scope Thesis" "Q1 Comm - 02 Scope Project" "Q1 Comm - 03 Learn Intern"    "Q1 Comm - 04 Biography"    "Q1 Comm - Overall Plan" "Q1_check"
10   NA                          NA                           4                              NA      4 "within"
31   2                           NA                           NA                             NA      2 "within"
225  0                           NA                           NA                             NA      1 "above"
243  NA                          2                            NA                             1       0 "below"
310  NA                          2                            NA                             1       NA NA

我的数据框 df 的 dput() 如下。

dput(df)

structure(list(ID = c(10L, 31L, 225L, 243L), Q1.Comm...01.Scope.Thesis = c(NA, 
2L, 0L, NA), Q1.Comm...02.Scope.Project = c(NA, NA, NA, 2L), 
    Q1.Comm...03.Learn.Intern = c(4L, NA, NA, NA), Q1.Comm...04.Biography = c(NA, 
    NA, NA, 1L), Q1.Comm...Overall.Plan = c(4L, 1L, 2L, 
    NA), X = c(NA, NA, NA, NA), X.1 = c(NA, NA, NA, NA), X.2 = c(NA, 
    NA, NA, NA)), class = "data.frame", row.names = c(NA, -4L
))

注:

我曾在这里问过这个问题 Finding if a value is within the range of other columns,但示例过于简单,none 的解决方案对我有用。

这个问题太长了,因此,为了清楚起见,我post将其作为一个新问题。

感谢您抽出宝贵时间帮助解决此问题 post。

你可以用 rowwisec_across 尝试这样的事情:

library(dplyr)
df %>%
  rowwise %>%
  summarise(ID = ID,
            Max = `Q1.Comm...Overall.Plan` > max(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE),
            Min = `Q1.Comm...Overall.Plan` < min(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE),
            Range = `Q1.Comm...Overall.Plan` >= range(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE)[1] &
                    `Q1.Comm...Overall.Plan` <= range(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE)[2]) %>%
  mutate(Result = case_when(Max ~ "above",
                            Min ~ "below",
                            Range ~ "within",
                            TRUE ~ NA_character_))
# A tibble: 4 x 5
     ID Max   Min   Range Result
  <int> <lgl> <lgl> <lgl> <chr> 
1    10 FALSE FALSE TRUE  within
2    31 FALSE FALSE TRUE  within
3   225 TRUE  FALSE FALSE above 
4   243 NA    NA    NA    NA    

您可以将 summarise 更改为 mutate 以保留原始列 and/or select 以删除它们。

有关详细信息,请参阅 dplyr rowwise tutorial

library(purrr)
library(data.table)

needed_cols <- setdiff(names(df), c("ID", "Q1.Comm...Overall.Plan"))

setDT(df)[, c("min", "max") := transpose(pmap(.SD, range, na.rm = TRUE)), .SDcols = needed_cols]
df[, Q1_check := fcase(
    is.na(`Q1.Comm...Overall.Plan`), NA_character_,
    `Q1.Comm...Overall.Plan` < min, "below",
    `Q1.Comm...Overall.Plan` > max, "above",
    default = "within"
  )
]
df[, c("max", "min") := NULL]

我已经修改了您的输出以满足您在链接问题中讨论的要求。我想这会对你有所帮助。我使用了 janitor::clean_names(),我建议您在继续之前使用它,以便清理您的列名。

所以修改后的dput是

df <- structure(list(id = c(10L, 31L, 225L, 243L), q1_comm_01_scope_thesis = c(NA, 
2L, 0L, NA), q1_comm_02_scope_project = c(NA, NA, NA, 2L), q1_comm_03_learn_intern = c(4L, 
NA, NA, NA), q1_comm_04_biography = c(NA, NA, NA, 1L), q1_comm_overall_plan = c(4L, 
1L, 2L, NA), q2_comm_01_scope_thesis = c(NA, 4, 0, NA), q2_comm_02_scope_project = c(NA, 
NA, NA, 4), q2_comm_03_learn_intern = c(8, NA, NA, NA), q2_comm_04_biography = c(NA, 
NA, NA, 2), q2_comm_overall_plan = c(8, 2, 4, NA)), row.names = c(NA, 
-4L), class = "data.frame")

df
   id q1_comm_01_scope_thesis q1_comm_02_scope_project q1_comm_03_learn_intern q1_comm_04_biography q1_comm_overall_plan q2_comm_01_scope_thesis
1  10                      NA                       NA                       4                   NA                    4                      NA
2  31                       2                       NA                      NA                   NA                    1                       4
3 225                       0                       NA                      NA                   NA                    2                       0
4 243                      NA                        2                      NA                    1                   NA                      NA
  q2_comm_02_scope_project q2_comm_03_learn_intern q2_comm_04_biography q2_comm_overall_plan
1                       NA                       8                   NA                    8
2                       NA                      NA                   NA                    2
3                       NA                      NA                   NA                    4
4                        4                      NA                    2                   NA

现在按照建议进行。 您必须修改 cur_data() 内的 [-5] 以满足您的要求(根据 overall_column 的相对位置,我认为在您的情况下为 9)

library(tidyverse)

split.default(df[-1], gsub('(q\d*)(.*)', '\1', names(df[-1]), perl = T)) %>%
  map(., ~ .x %>% bind_cols('id' = df$id) %>%
        group_by(id) %>%
        mutate(across(ends_with('_overall_plan'), ~ case_when(. < min(cur_data()[-5], na.rm = T) ~ 'below',
                                                              . > max(cur_data()[-5], na.rm = T) ~ 'above',
                                                              is.na(.) ~ NA_character_,
                                                              TRUE ~ 'within'),
                      .names = '{str_remove(.col,"_comm_overall_plan")}_check'))
        ) %>%
  reduce(left_join, by = 'id')

# A tibble: 4 x 13
# Groups:   id [4]
  q1_comm_01_scop~ q1_comm_02_scop~ q1_comm_03_lear~ q1_comm_04_biog~ q1_comm_overall~    id q1_check q2_comm_01_scop~ q2_comm_02_scop~ q2_comm_03_lear~ q2_comm_04_biog~
             <int>            <int>            <int>            <int>            <int> <int> <chr>               <dbl>            <dbl>            <dbl>            <dbl>
1               NA               NA                4               NA                4    10 within                 NA               NA                8               NA
2                2               NA               NA               NA                1    31 below                   4               NA               NA               NA
3                0               NA               NA               NA                2   225 above                   0               NA               NA               NA
4               NA                2               NA                1               NA   243 NA                     NA                4               NA                2
# ... with 2 more variables: q2_comm_overall_plan <dbl>, q2_check <chr>