如何找到不是不同变量的行中的最大值
How to find the highest value in a row which is not a distinct variable
我有这个数据框
mydf <- structure(list(POS = c("1", "2", "3", "4"), A = c("10", "10",
"6", "1"), C = c("1", "8", "2", "7"), T = c("6", "2", "10", "8"
), G = c("0", "0", "2", "11"), Ref = c("A", "A", "T", "C")), class = "data.frame", row.names = c(NA,
-4L))
看起来像这样
POS A C T G Ref
1 10 1 6 0 A
2 10 8 2 0 A
3 6 2 10 2 T
4 1 7 8 11 C
我的目标是提取每一行的最大值,这不是参考文献中所述的最大值。第一行的意思是我想提取 T
的值,因为它具有最高值,而不是 Ref A
。在第二行我想要 C
的值等等...
这里POS栏不算,全是A,T,G,C
不幸的是,我必须在相当多的行上执行此操作,因此我需要一个自动化的解决方案。
我很高兴有一个 dplyr 解决方案,因为我正试图专注于 dplyr :)
非常感谢!
非常感谢您提供的所有答案,有多个正确的解决方案,我只是选择了一个我目前正在使用的解决方案。其他答案也可以!
您可以在 apply
中尝试 max
:
apply(sapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
#[1] 6 8 6 11
或使用pmax
:
do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
#[1] 6 8 6 11
基准:
library(dplyr)
bench::mark(check = FALSE
, apply = apply(sapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
, do.call = do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
, mapply = mapply(function(x, i) max(as.numeric(unlist(x))[-i]),
x = split(mydf[, 2:5], seq(nrow(mydf))),
i = match(mydf$Ref, names(mydf)[-1]))
, sapply = sapply(split(mydf, seq(nrow(mydf))),
function(x) max(as.numeric(x[, setdiff(c("A", "C", "T", "G"), x$Ref)])))
, dplyr = {mydf %>%
rowwise() %>%
mutate(Res = Reduce(pmax, across(A:G, ~ as.numeric(.) * (. != get(Ref)))))}
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 apply 103.7µs 111.06µs 8861. 4.13KB 14.5 4291 7
#2 do.call 63.3µs 68.56µs 14072. 4.13KB 14.4 6825 7
#3 mapply 323.3µs 355.44µs 2747. 14.55KB 12.4 1329 6
#4 sapply 469.4µs 516.12µs 1855. 16.5KB 12.5 892 6
#5 dplyr 7.6ms 8.26ms 120. 23.35KB 11.1 54 5
使用 pmax
而不是 do.call
看起来是最快的并且使用更少的内存。
一个dplyr
选项可以是:
mydf %>%
rowwise() %>%
mutate(Res = Reduce(pmax, across(A:G, ~ . * (. != get(Ref)))))
POS A C T G Ref Res
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 10 1 6 0 A 6
2 2 10 8 2 0 A 8
3 3 6 2 10 2 T 6
4 4 1 7 8 11 C 11
基础 R 解决方案是
sapply(split(mydf, seq(nrow(mydf))),
function(x) max(x[, setdiff(c("A", "C", "T", "G"), x$Ref)]))
#R> 1 2 3 4
#R> 6 8 6 11
或
mapply(function(x, i) max(x[-i]),
x = split(as.matrix(mydf[, 2:5]), seq(nrow(mydf))),
i = match(mydf$Ref, names(mydf)[-1]))
#R> 1 2 3 4
#R> 6 8 6 11
或喜欢
x <- as.matrix(mydf[, c("A", "C", "T", "G")])
x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
apply(x, 1, max, na.rm = TRUE)
#R> [1] 6 8 6 11
# in R 4.1.0 or greater
as.matrix(mydf[, c("A", "C", "T", "G")]) |>
(\(x){
x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
x
})() |>
apply(1, max, na.rm = TRUE)
#R> [1] 6 8 6 11
我首先将列转换为数字变量,如下所示,因为我假设这是您想要的:
mydf[, c("A", "C", "T", "G")] <-
lapply(mydf[, c("A", "C", "T", "G")], as.numeric)
您可以将 Ref
列中的值变为 NA
并使用 pmax
获得行最大忽略 NA
值。
mydf <- type.convert(mydf, as.is = TRUE)
tmp <- mydf
tmp[cbind(1:nrow(tmp), match(tmp$Ref, names(tmp)))] <- NA
mydf$max_value <- do.call(pmax, c(tmp[2:5], na.rm = TRUE))
mydf
# POS A C T G Ref max_value
#1 1 10 1 6 0 A 6
#2 2 10 8 2 0 A 8
#3 3 6 2 10 2 T 6
#4 4 1 7 8 11 C 11
我有这个数据框
mydf <- structure(list(POS = c("1", "2", "3", "4"), A = c("10", "10",
"6", "1"), C = c("1", "8", "2", "7"), T = c("6", "2", "10", "8"
), G = c("0", "0", "2", "11"), Ref = c("A", "A", "T", "C")), class = "data.frame", row.names = c(NA,
-4L))
看起来像这样
POS A C T G Ref
1 10 1 6 0 A
2 10 8 2 0 A
3 6 2 10 2 T
4 1 7 8 11 C
我的目标是提取每一行的最大值,这不是参考文献中所述的最大值。第一行的意思是我想提取 T
的值,因为它具有最高值,而不是 Ref A
。在第二行我想要 C
的值等等...
这里POS栏不算,全是A,T,G,C
不幸的是,我必须在相当多的行上执行此操作,因此我需要一个自动化的解决方案。
我很高兴有一个 dplyr 解决方案,因为我正试图专注于 dplyr :)
非常感谢!
非常感谢您提供的所有答案,有多个正确的解决方案,我只是选择了一个我目前正在使用的解决方案。其他答案也可以!
您可以在 apply
中尝试 max
:
apply(sapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
#[1] 6 8 6 11
或使用pmax
:
do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
#[1] 6 8 6 11
基准:
library(dplyr)
bench::mark(check = FALSE
, apply = apply(sapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), 1, max, na.rm=TRUE)
, do.call = do.call(pmax, c(lapply(c("A", "C", "T", "G"), function(i)
`[<-`(as.numeric(mydf[[i]]), mydf$Ref == i, NA)), na.rm=TRUE))
, mapply = mapply(function(x, i) max(as.numeric(unlist(x))[-i]),
x = split(mydf[, 2:5], seq(nrow(mydf))),
i = match(mydf$Ref, names(mydf)[-1]))
, sapply = sapply(split(mydf, seq(nrow(mydf))),
function(x) max(as.numeric(x[, setdiff(c("A", "C", "T", "G"), x$Ref)])))
, dplyr = {mydf %>%
rowwise() %>%
mutate(Res = Reduce(pmax, across(A:G, ~ as.numeric(.) * (. != get(Ref)))))}
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 apply 103.7µs 111.06µs 8861. 4.13KB 14.5 4291 7
#2 do.call 63.3µs 68.56µs 14072. 4.13KB 14.4 6825 7
#3 mapply 323.3µs 355.44µs 2747. 14.55KB 12.4 1329 6
#4 sapply 469.4µs 516.12µs 1855. 16.5KB 12.5 892 6
#5 dplyr 7.6ms 8.26ms 120. 23.35KB 11.1 54 5
使用 pmax
而不是 do.call
看起来是最快的并且使用更少的内存。
一个dplyr
选项可以是:
mydf %>%
rowwise() %>%
mutate(Res = Reduce(pmax, across(A:G, ~ . * (. != get(Ref)))))
POS A C T G Ref Res
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 10 1 6 0 A 6
2 2 10 8 2 0 A 8
3 3 6 2 10 2 T 6
4 4 1 7 8 11 C 11
基础 R 解决方案是
sapply(split(mydf, seq(nrow(mydf))),
function(x) max(x[, setdiff(c("A", "C", "T", "G"), x$Ref)]))
#R> 1 2 3 4
#R> 6 8 6 11
或
mapply(function(x, i) max(x[-i]),
x = split(as.matrix(mydf[, 2:5]), seq(nrow(mydf))),
i = match(mydf$Ref, names(mydf)[-1]))
#R> 1 2 3 4
#R> 6 8 6 11
或喜欢
x <- as.matrix(mydf[, c("A", "C", "T", "G")])
x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
apply(x, 1, max, na.rm = TRUE)
#R> [1] 6 8 6 11
# in R 4.1.0 or greater
as.matrix(mydf[, c("A", "C", "T", "G")]) |>
(\(x){
x[rep(c("A", "C", "T", "G"), each = NROW(mydf)) == mydf$Ref] <- NA_real_
x
})() |>
apply(1, max, na.rm = TRUE)
#R> [1] 6 8 6 11
我首先将列转换为数字变量,如下所示,因为我假设这是您想要的:
mydf[, c("A", "C", "T", "G")] <-
lapply(mydf[, c("A", "C", "T", "G")], as.numeric)
您可以将 Ref
列中的值变为 NA
并使用 pmax
获得行最大忽略 NA
值。
mydf <- type.convert(mydf, as.is = TRUE)
tmp <- mydf
tmp[cbind(1:nrow(tmp), match(tmp$Ref, names(tmp)))] <- NA
mydf$max_value <- do.call(pmax, c(tmp[2:5], na.rm = TRUE))
mydf
# POS A C T G Ref max_value
#1 1 10 1 6 0 A 6
#2 2 10 8 2 0 A 8
#3 3 6 2 10 2 T 6
#4 4 1 7 8 11 C 11