我正在尝试创建一个函数来计算向量中一组内所有对组合的百分比差异
I'm trying to make a function to calculate percent difference for all pair combinations within a group in a vector
因此我需要计算 y
列中所有值组合的百分比差异。例如 B 1 和 B 2 之间的差异。然后是 B 1 和 B 3 之间的差异,等等对于 B 的所有组合。然后对于 D 的所有组合也是如此。
这是一些示例数据...
structure(list(Levelname = c("B 1", "B 2", "B 3",
"B 4", "D 1", "D 2", "D 3", "D 4"), y = c(0.679428655093332,
1.07554328679719, 0.883000346050764, 0.791772867506205, 0.538143790501689,
0.805122127560562, 0.591353204313314, 0.795225886492002), fill = c("midnightblue",
"dodgerblue4", "steelblue3", "lightskyblue", "midnightblue",
"dodgerblue4", "steelblue3", "lightskyblue"), species = c("White Grunt",
"White Grunt", "White Grunt", "White Grunt", "White Grunt", "White Grunt",
"White Grunt", "White Grunt")), row.names = c(NA, -8L), class = "data.frame")
我理想的输出是带有某种标识符的数据框,例如
Pair Percent Difference
B 1 - B 2 45.142
B 1 - B 3 .....
B 1 - B 4 .....
B 2 - B 3 .....
B 2 - B 4 .....
B 3 - B 4 .....
D 1 - D 2 .....
D 1 - D 3 .....
D 1 - D 4 .....
D 2 - D 3 .....
D 2 - D 4 .....
D 3 - D 4 .....
where ..... are the percent differences
我不关心 B 和 D 之间的区别。我也在努力提高函数、for
循环和 r 的 apply
函数,所以如果答案可以使用那些或各种很棒的东西。
我试图查看这些答案,但我无法弄明白...
Loops in R - Need to use index, anyway to avoid 'for'?
45.142
我用这个计算的
|B1−B2|/[(B1+B2)/2]×100=?
=|0.67942865509333−1.0755433|/[(0.67942865509333+1.0755433)/2]×100
=|−0.39611464490667|/[1.7549719550933/2]×100
=0.39611464490667/0.87748597754667×100
=0.45142×100
=45.142% difference
我们可以使用 outer
,使用 y
值进行计算,使用 Levelnames
进行计算 paste
,在每种情况下我们只使用 upper.tri
.
f <- \(x, y) abs(x - y)*100 / ((x + y) / 2) ## your p_diff formula
p_diff <- outer(dt$y, dt$y, f) |>
{\(x) abs(x[upper.tri(x)])}() |>
round(3)
Pair <- outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
{\(x) x[upper.tri(x)]}()
res <- data.frame(Pair, p_diff)
结果
res
# Pair p_diff
# 1 B 1 - B 2 45.142
# 2 B 1 - B 3 26.058
# 3 B 2 - B 3 19.662
# 4 B 1 - B 4 15.272
# 5 B 2 - B 4 30.393
# 6 B 3 - B 4 10.894
# 7 B 1 - D 1 23.208
# 8 B 2 - D 1 66.605
# 9 B 3 - D 1 48.532
# 10 B 4 - D 1 38.142
# 11 B 1 - D 2 16.934
# 12 B 2 - D 2 28.758
# 13 B 3 - D 2 9.227
# 14 B 4 - D 2 1.672
# 15 D 1 - D 2 39.751
# 16 B 1 - D 3 13.862
# 17 B 2 - D 3 58.095
# 18 B 3 - D 3 39.563
# 19 B 4 - D 3 28.981
# 20 D 1 - D 3 9.422
# 21 D 2 - D 3 30.615
# 22 B 1 - D 4 15.705
# 23 B 2 - D 4 29.968
# 24 B 3 - D 4 10.460
# 25 B 4 - D 4 0.435
# 26 D 1 - D 4 38.561
# 27 D 2 - D 4 1.237
# 28 D 3 - D 4 29.407
基准
我怀疑 tidy 方法是否更快,我是对的。在这里,我提供了一个比较目前解决方案的基准。因此,outer
方法快了将近 20 倍。
f1 <- \() data.frame(p_diff=outer(dt$y, dt$y, f) |>
{\(x) abs(x[upper.tri(x)])}() |>
round(3),
Pair=outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
{\(x) x[upper.tri(x)]}())
library(dplyr);library(stringr)
f2 <- \() dt %>%
group_by(grp = str_extract(Levelname, "\w+"))%>%
summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
.groups = 'drop')
dt <- dt[sample(nrow(dt), 1e3, replace=T), ]
microbenchmark::microbenchmark(outer=f1(), tidyverse=f2(), times=3L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# outer 236.7207 243.0496 306.1265 249.3785 340.8294 432.2804 3 a
# tidyverse 4819.3476 4830.7364 4838.5051 4842.1251 4848.0839 4854.0427 3 b
使用 tidyverse:
library(tidyverse)
df %>%
group_by(grp = str_extract(Levelname, "\w+"))%>%
summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
.groups = 'drop')
A tibble: 12 x 3
grp pair perc_diff
<chr> <chr> <dbl>
1 B B 1 - B 2 45.1
2 B B 1 - B 3 26.1
3 B B 1 - B 4 15.3
4 B B 2 - B 3 19.7
5 B B 2 - B 4 30.4
6 B B 3 - B 4 10.9
7 D D 1 - D 2 39.8
8 D D 1 - D 3 9.42
9 D D 1 - D 4 38.6
10 D D 2 - D 3 30.6
11 D D 2 - D 4 1.24
12 D D 3 - D 4 29.4
因此我需要计算 y
列中所有值组合的百分比差异。例如 B 1 和 B 2 之间的差异。然后是 B 1 和 B 3 之间的差异,等等对于 B 的所有组合。然后对于 D 的所有组合也是如此。
这是一些示例数据...
structure(list(Levelname = c("B 1", "B 2", "B 3",
"B 4", "D 1", "D 2", "D 3", "D 4"), y = c(0.679428655093332,
1.07554328679719, 0.883000346050764, 0.791772867506205, 0.538143790501689,
0.805122127560562, 0.591353204313314, 0.795225886492002), fill = c("midnightblue",
"dodgerblue4", "steelblue3", "lightskyblue", "midnightblue",
"dodgerblue4", "steelblue3", "lightskyblue"), species = c("White Grunt",
"White Grunt", "White Grunt", "White Grunt", "White Grunt", "White Grunt",
"White Grunt", "White Grunt")), row.names = c(NA, -8L), class = "data.frame")
我理想的输出是带有某种标识符的数据框,例如
Pair Percent Difference
B 1 - B 2 45.142
B 1 - B 3 .....
B 1 - B 4 .....
B 2 - B 3 .....
B 2 - B 4 .....
B 3 - B 4 .....
D 1 - D 2 .....
D 1 - D 3 .....
D 1 - D 4 .....
D 2 - D 3 .....
D 2 - D 4 .....
D 3 - D 4 .....
where ..... are the percent differences
我不关心 B 和 D 之间的区别。我也在努力提高函数、for
循环和 r 的 apply
函数,所以如果答案可以使用那些或各种很棒的东西。
我试图查看这些答案,但我无法弄明白...
Loops in R - Need to use index, anyway to avoid 'for'?
45.142
我用这个计算的
|B1−B2|/[(B1+B2)/2]×100=?
=|0.67942865509333−1.0755433|/[(0.67942865509333+1.0755433)/2]×100
=|−0.39611464490667|/[1.7549719550933/2]×100
=0.39611464490667/0.87748597754667×100
=0.45142×100
=45.142% difference
我们可以使用 outer
,使用 y
值进行计算,使用 Levelnames
进行计算 paste
,在每种情况下我们只使用 upper.tri
.
f <- \(x, y) abs(x - y)*100 / ((x + y) / 2) ## your p_diff formula
p_diff <- outer(dt$y, dt$y, f) |>
{\(x) abs(x[upper.tri(x)])}() |>
round(3)
Pair <- outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
{\(x) x[upper.tri(x)]}()
res <- data.frame(Pair, p_diff)
结果
res
# Pair p_diff
# 1 B 1 - B 2 45.142
# 2 B 1 - B 3 26.058
# 3 B 2 - B 3 19.662
# 4 B 1 - B 4 15.272
# 5 B 2 - B 4 30.393
# 6 B 3 - B 4 10.894
# 7 B 1 - D 1 23.208
# 8 B 2 - D 1 66.605
# 9 B 3 - D 1 48.532
# 10 B 4 - D 1 38.142
# 11 B 1 - D 2 16.934
# 12 B 2 - D 2 28.758
# 13 B 3 - D 2 9.227
# 14 B 4 - D 2 1.672
# 15 D 1 - D 2 39.751
# 16 B 1 - D 3 13.862
# 17 B 2 - D 3 58.095
# 18 B 3 - D 3 39.563
# 19 B 4 - D 3 28.981
# 20 D 1 - D 3 9.422
# 21 D 2 - D 3 30.615
# 22 B 1 - D 4 15.705
# 23 B 2 - D 4 29.968
# 24 B 3 - D 4 10.460
# 25 B 4 - D 4 0.435
# 26 D 1 - D 4 38.561
# 27 D 2 - D 4 1.237
# 28 D 3 - D 4 29.407
基准
我怀疑 tidy 方法是否更快,我是对的。在这里,我提供了一个比较目前解决方案的基准。因此,outer
方法快了将近 20 倍。
f1 <- \() data.frame(p_diff=outer(dt$y, dt$y, f) |>
{\(x) abs(x[upper.tri(x)])}() |>
round(3),
Pair=outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
{\(x) x[upper.tri(x)]}())
library(dplyr);library(stringr)
f2 <- \() dt %>%
group_by(grp = str_extract(Levelname, "\w+"))%>%
summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
.groups = 'drop')
dt <- dt[sample(nrow(dt), 1e3, replace=T), ]
microbenchmark::microbenchmark(outer=f1(), tidyverse=f2(), times=3L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# outer 236.7207 243.0496 306.1265 249.3785 340.8294 432.2804 3 a
# tidyverse 4819.3476 4830.7364 4838.5051 4842.1251 4848.0839 4854.0427 3 b
使用 tidyverse:
library(tidyverse)
df %>%
group_by(grp = str_extract(Levelname, "\w+"))%>%
summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
.groups = 'drop')
A tibble: 12 x 3
grp pair perc_diff
<chr> <chr> <dbl>
1 B B 1 - B 2 45.1
2 B B 1 - B 3 26.1
3 B B 1 - B 4 15.3
4 B B 2 - B 3 19.7
5 B B 2 - B 4 30.4
6 B B 3 - B 4 10.9
7 D D 1 - D 2 39.8
8 D D 1 - D 3 9.42
9 D D 1 - D 4 38.6
10 D D 2 - D 3 30.6
11 D D 2 - D 4 1.24
12 D D 3 - D 4 29.4