如何找到不是不同变量的行中的最大值

How to find the highest value in a row which is not a distinct variable

我有这个数据框

mydf <- structure(list(POS = c("1", "2", "3", "4"), A = c("10", "10", 
"6", "1"), C = c("1", "8", "2", "7"), T = c("6", "2", "10", "8"
), G = c("0", "0", "2", "11"), Ref = c("A", "A", "T", "C")), class = "data.frame", row.names = c(NA, 
-4L))

看起来像这样

   POS  A    C   T    G    Ref
    1   10   1   6    0     A
    2   10   8   2    0     A
    3   6    2   10   2     T
    4   1    7   8    11    C

我的目标是提取每一行的最大值,这不是参考文献中所述的最大值。第一行的意思是我想提取 T 的值,因为它具有最高值,而不是 Ref A。在第二行我想要 C 的值等等...

这里POS栏不算,全是A,T,G,C

不幸的是,我必须在相当多的行上执行此操作,因此我需要一个自动化的解决方案。

我很高兴有一个 dplyr 解决方案,因为我正试图专注于 dplyr :)

非常感谢!

非常感谢您提供的所有答案,有多个正确的解决方案,我只是选择了一个我目前正在使用的解决方案。其他答案也可以!

您可以在 apply 中尝试 max:

apply(sapply(c("A", "C", "T", "G"), function(i)
   `[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
#[1]  6  8  6 11

或使用pmax:

do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
    `[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
#[1]  6  8  6 11

基准:

library(dplyr)
bench::mark(check = FALSE
 , apply = apply(sapply(c("A", "C", "T", "G"), function(i)
   `[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
 , do.call = do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
   `[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
 , mapply = mapply(function(x, i) max(as.numeric(unlist(x))[-i]), 
       x = split(mydf[, 2:5], seq(nrow(mydf))), 
       i = match(mydf$Ref, names(mydf)[-1]))
 , sapply = sapply(split(mydf, seq(nrow(mydf))), 
                   function(x) max(as.numeric(x[, setdiff(c("A", "C", "T", "G"), x$Ref)])))
 , dplyr = {mydf %>%
    rowwise() %>%
     mutate(Res = Reduce(pmax, across(A:G, ~ as.numeric(.) * (. != get(Ref)))))}
   )
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 apply       103.7µs 111.06µs     8861.    4.13KB     14.5  4291     7
#2 do.call      63.3µs  68.56µs    14072.    4.13KB     14.4  6825     7
#3 mapply      323.3µs 355.44µs     2747.   14.55KB     12.4  1329     6
#4 sapply      469.4µs 516.12µs     1855.    16.5KB     12.5   892     6
#5 dplyr         7.6ms   8.26ms      120.   23.35KB     11.1    54     5

使用 pmax 而不是 do.call 看起来是最快的并且使用更少的内存。

一个dplyr选项可以是:

mydf %>%
    rowwise() %>%
    mutate(Res = Reduce(pmax, across(A:G, ~ . * (. != get(Ref)))))

    POS     A     C     T     G Ref     Res
  <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1     1    10     1     6     0 A         6
2     2    10     8     2     0 A         8
3     3     6     2    10     2 T         6
4     4     1     7     8    11 C        11

基础 R 解决方案是

sapply(split(mydf, seq(nrow(mydf))), 
       function(x) max(x[, setdiff(c("A", "C", "T", "G"), x$Ref)]))
#R>  1  2  3  4 
#R>  6  8  6 11

mapply(function(x, i) max(x[-i]), 
       x = split(as.matrix(mydf[, 2:5]), seq(nrow(mydf))), 
       i = match(mydf$Ref, names(mydf)[-1]))
#R>  1  2  3  4 
#R>  6  8  6 11 

或喜欢

x <- as.matrix(mydf[, c("A", "C", "T", "G")])
x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
apply(x, 1, max, na.rm = TRUE)
#R> [1]  6  8  6 11

# in R 4.1.0 or greater
as.matrix(mydf[, c("A", "C", "T", "G")]) |>
  (\(x){ 
   x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
   x
  })() |>
  apply(1, max, na.rm = TRUE)
#R> [1]  6  8  6 11

我首先将列转换为数字变量,如下所示,因为我假设这是您想要的:

mydf[, c("A", "C", "T", "G")] <- 
  lapply(mydf[, c("A", "C", "T", "G")], as.numeric)

您可以将 Ref 列中的值变为 NA 并使用 pmax 获得行最大忽略 NA 值。

mydf <- type.convert(mydf, as.is = TRUE)
tmp <- mydf
tmp[cbind(1:nrow(tmp), match(tmp$Ref, names(tmp)))] <- NA
mydf$max_value <- do.call(pmax, c(tmp[2:5], na.rm = TRUE))
mydf

#  POS  A C  T  G Ref max_value
#1   1 10 1  6  0   A         6
#2   2 10 8  2  0   A         8
#3   3  6 2 10  2   T         6
#4   4  1 7  8 11   C        11