当按其他数据框变量分组时,如何生成 list-column holding named-vectors?
How to generate a list-column holding named-vectors, when grouping by other data frame variables?
有一个数据框,我想生成一个新的 list-column 包含命名向量(每行一个向量)。每个向量的名称和值都来自其他 2 个数据框列。但我被卡住了,因为我想这样做:
- 按组
- 尽可能computationally-efficient
例子
让我们用 {ggplot2}
的 mpg
数据集来说明 by group 原则。我想将成对的 cty
和 hwy
值放在一起,按 manufacturer
和 year
的不同组合分组。所以我们可以这样做:
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
my_mpg <-
mpg %>%
select(manufacturer, year, cty, hwy)
via_tidyr_nest <-
my_mpg %>%
group_by(manufacturer, year) %>%
nest()
via_tidyr_nest
#> # A tibble: 30 x 3
#> # Groups: manufacturer, year [30]
#> manufacturer year data
#> <chr> <int> <list>
#> 1 audi 1999 <tibble [9 x 2]>
#> 2 audi 2008 <tibble [9 x 2]>
#> 3 chevrolet 2008 <tibble [12 x 2]>
#> 4 chevrolet 1999 <tibble [7 x 2]>
#> 5 dodge 1999 <tibble [16 x 2]>
#> 6 dodge 2008 <tibble [21 x 2]>
#> 7 ford 1999 <tibble [15 x 2]>
#> 8 ford 2008 <tibble [10 x 2]>
#> 9 honda 1999 <tibble [5 x 2]>
#> 10 honda 2008 <tibble [4 x 2]>
#> # ... with 20 more rows
由 reprex package (v0.3.0)
于 2021-09-27 创建
这很完美,只是我不想要嵌套的 tibble,而是想要嵌套的命名向量。 (原因:一旦我们将输出存储为环境中的 object,命名的矢量版本比嵌套的 tibble 版本更轻)。
可行但不受欢迎的解决方案 将采用 via_tidyr_nest
并将嵌套的 tibble 转换为命名向量。
expected_output <-
via_tidyr_nest %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
expected_output
#> # A tibble: 30 x 3
#> # Groups: manufacturer, year [30]
#> manufacturer year desired_named_vec
#> <chr> <int> <list>
#> 1 audi 1999 <int [9]>
#> 2 audi 2008 <int [9]>
#> 3 chevrolet 2008 <int [12]>
#> 4 chevrolet 1999 <int [7]>
#> 5 dodge 1999 <int [16]>
#> 6 dodge 2008 <int [21]>
#> 7 ford 1999 <int [15]>
#> 8 ford 2008 <int [10]>
#> 9 honda 1999 <int [5]>
#> 10 honda 2008 <int [4]>
#> # ... with 20 more rows
这是不希望的,因为它通过迂回实现了预期的输出。首先,它创建一个 tibble,然后转换为一个命名向量。虽然在此示例中处理时间可以忽略不计,但实际上我有一个大型数据集(1000 万行)。因此,增加任何额外的步骤都是昂贵的。相反,我希望以尽可能少的步骤到达 expected_output
。
一次不成功的尝试:
library(purrr)
via_summarise_map2_setnames <-
my_mpg %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(.x = cty, .y = hwy, .f = ~setNames(.x, .y)))
#> `summarise()` has grouped output by 'manufacturer', 'year'. You can override using the `.groups` argument.
via_summarise_map2_setnames
#> # A tibble: 234 x 3
#> # Groups: manufacturer, year [30]
#> manufacturer year named_vec
#> <chr> <int> <list>
#> 1 audi 1999 <int [1]>
#> 2 audi 1999 <int [1]>
#> 3 audi 1999 <int [1]>
#> 4 audi 1999 <int [1]>
#> 5 audi 1999 <int [1]>
#> 6 audi 1999 <int [1]>
#> 7 audi 1999 <int [1]>
#> 8 audi 1999 <int [1]>
#> 9 audi 1999 <int [1]>
#> 10 audi 2008 <int [1]>
#> # ... with 224 more rows
知道如何直接从 my_mpg
转到 expected_output
,而不创建小标题 in-between 吗?
编辑
只是在这个问题的背景下的一般想法。我知道 tidyr::nest()
的默认行为是 return 一个嵌套的 tibble。但是我没有找到关于这个决定的任何讨论。换句话说,如果我们想自己选择嵌套数据的 class 怎么办?它可以是默认的 tibble
,也可以是 data.frame
、data.table
、named vector
等。无论用户选择什么作为输出 class。
已编辑:将 'map' 替换为 'Map'
希望这对您有用。
您的解决方案在 'f' 内,我的建议在 'g' 内。
它使用 dplyr 'group_by' 创建的索引来收集构建命名向量所需的数据。
f <- function () {
via_tidyr_nest <-
my_mpg %>%
group_by(manufacturer, year) %>%
nest()
expected_output <-
via_tidyr_nest %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
}
g <- function () {
df1 <- my_mpg %>% group_by(manufacturer, year)
df2 <- attr(df1,"groups")
Map(function(rows) {
r <- df1[rows,"cty",drop=TRUE]
setNames(r,df1[rows,"hwy",drop=TRUE])
},
df2$.rows
) -> l
df <- data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}
# other solutions
h <- function () {
hdf <- my_mpg %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)))
}
k <- function() {
mpg |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
}
library(microbenchmark)
microbenchmark(OP=f(),Nicolas2=g(),Rui=h(),Till=k())
Unit: milliseconds
expr min lq mean median uq max neval
OP 21.8917 22.64035 24.389126 23.28235 24.70075 39.9593 100
Nicolas2 3.0507 3.15920 3.481469 3.24625 3.57840 7.3173 100
Rui 6.5460 6.75300 7.505564 7.16255 7.64390 12.0359 100
Till 31.2364 32.31115 34.940356 32.92990 36.11505 107.2709 100
这是一个方法。在设置名称之前将 cty
和 hwy
强制转换为 "list"
。好像还行。
library(purrr)
library(dplyr)
data(mpg, package = "ggplot2")
my_mpg <-
mpg %>%
select(manufacturer, year, cty, hwy)
my_mpg %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)))
#`summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
## A tibble: 30 x 3
## Groups: manufacturer [15]
# manufacturer year named_vec
# <chr> <int> <list>
# 1 audi 1999 <int [9]>
# 2 audi 2008 <int [9]>
# 3 chevrolet 1999 <int [7]>
# 4 chevrolet 2008 <int [12]>
# 5 dodge 1999 <int [16]>
# 6 dodge 2008 <int [21]>
# 7 ford 1999 <int [15]>
# 8 ford 2008 <int [10]>
# 9 honda 1999 <int [5]>
#10 honda 2008 <int [4]>
## … with 20 more rows
基准
由于问题是性能问题,这里是 4 个提出的解决方案的基准,到目前为止,问题的 , 和我上面的。
f <- function(X) {
X %>%
group_by(manufacturer, year) %>%
nest() %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
}
g <- function(X) {
df1 <- X %>% group_by(manufacturer, year)
df2 <- attr(df1,"groups")
Map(function(rows) {
r <- df1[rows,"cty",drop=TRUE]
setNames(r,df1[rows,"hwy",drop=TRUE])
},
df2$.rows
) -> l
data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}
h <- function(X){
X %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)), .groups = "drop")
}
i <- function(X){
X |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
}
mb <- microbenchmark(
Emman = f(my_mpg),
Nicolas2 = g(my_mpg),
Rui = h(my_mpg),
Till = i(my_mpg)
)
print(mb, unit = "relative", order = "median")
#Unit: relative
# expr min lq mean median uq max neval cld
# Rui 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# Nicolas2 1.527957 1.468524 1.478286 1.482185 1.471565 1.724004 100 b
# Emman 4.504185 4.230921 4.215643 4.234087 4.148188 4.170934 100 c
# Till 6.264028 5.813678 5.883107 5.810876 5.744080 5.666524 100 d
这里可以使用dplyr::group_modify()
和tibble::deframe()
。而不是 deframe()
您问题中的 pull(x, cty, hwy)
会起到同样的作用。
library(tidyverse)
mpg |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
#> # A tibble: 30 × 3
#> # Groups: manufacturer, year [30]
#> manufacturer year res
#> <chr> <int> <list>
#> 1 audi 1999 <int [9]>
#> 2 audi 2008 <int [9]>
#> 3 chevrolet 1999 <int [7]>
#> 4 chevrolet 2008 <int [12]>
#> 5 dodge 1999 <int [16]>
#> 6 dodge 2008 <int [21]>
#> 7 ford 1999 <int [15]>
#> 8 ford 2008 <int [10]>
#> 9 honda 1999 <int [5]>
#> 10 honda 2008 <int [4]>
#> # … with 20 more rows
到目前为止,@RuiBarradas 的解决方案是最快的。我想出了一个 data.table
版本,它似乎提高了一点速度。
library(ggplot2)
library(purrr)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
my_mpg <-
mpg %>%
select(manufacturer, year, hwy, cty)
my_mpg %>%
as.data.table() %>%
.[,.(named_vec = map2(.x = list(cty), .y = list(hwy), .f = ~setNames(.x, .y))),.(manufacturer, year)] %>%
as_tibble()
#> # A tibble: 30 x 3
#> manufacturer year named_vec
#> <chr> <int> <list>
#> 1 audi 1999 <int [9]>
#> 2 audi 2008 <int [9]>
#> 3 chevrolet 2008 <int [12]>
#> 4 chevrolet 1999 <int [7]>
#> 5 dodge 1999 <int [16]>
#> 6 dodge 2008 <int [21]>
#> 7 ford 1999 <int [15]>
#> 8 ford 2008 <int [10]>
#> 9 honda 1999 <int [5]>
#> 10 honda 2008 <int [4]>
#> # ... with 20 more rows
由 reprex package (v0.3.0)
于 2021-09-28 创建
基准测试
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(tibble)
library(purrr)
library(data.table, warn.conflicts = FALSE)
my_mpg <-
mpg %>%
select(manufacturer, year, cty, hwy)
f <- function(X) {
X %>%
group_by(manufacturer, year) %>%
nest() %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
}
g <- function(X) {
df1 <- my_mpg %>% group_by(manufacturer, year)
df2 <- attr(df1,"groups")
Map(function(rows) {
r <- df1[rows,"cty",drop=TRUE]
setNames(r,df1[rows,"hwy",drop=TRUE])
},
df2$.rows
) -> l
data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}
h <- function(X){
X %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)), .groups = "drop")
}
i <- function(X){
X |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
}
j <- function(X){
X %>%
as.data.table() %>%
.[,.(named_vec = map2(.x = list(cty), .y = list(hwy), .f = ~setNames(.x, .y))),.(manufacturer, year)] %>%
as_tibble()
}
library(microbenchmark)
library(bench)
mb <- microbenchmark(
Emman_OP = f(my_mpg),
Nicolas2 = g(my_mpg),
Rui = h(my_mpg),
Till = i(mpg),
Emman_data.table_version_of_Rui = j(my_mpg)
)
print(mb, unit = "relative", order = "median")
#> Unit: relative
#> expr min lq mean median
#> Emman_data.table_version_of_Rui 1.000000 1.000000 1.000000 1.000000
#> Rui 2.472627 2.457073 2.392998 2.409865
#> Nicolas2 3.317832 3.317378 3.116434 3.295358
#> Emman_OP 10.255926 10.472251 9.842886 10.674290
#> Till 14.061003 14.333661 13.115049 14.937978
#> uq max neval
#> 1.000000 1.000000 100
#> 2.395210 2.191381 100
#> 3.258533 2.719938 100
#> 10.572811 5.331644 100
#> 14.086673 5.418907 100
不错,@Rui的data.table
适配是最快的。
但!
如果我们看一下内存分配(这是性能的另一个方面):
bm <- bench::mark(Emman_OP = f(my_mpg),
Nicolas2 = g(my_mpg),
Rui = h(my_mpg),
Till = i(mpg),
Emman_data.table_version_of_Rui = j(my_mpg), check = FALSE)
summary(bm, relative = TRUE)
#> # A tibble: 5 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Emman_OP 12.0 11.8 1.24 13.8 1.18
#> 2 Nicolas2 3.75 3.67 3.84 1 1.08
#> 3 Rui 2.77 2.76 5.18 1.49 1.06
#> 4 Till 15.1 15.7 1 7.10 1.45
#> 5 Emman_data.table_version_of_Rui 1 1 14.3 7.53 1
可以看到@Nicols2 是最轻的,@Rui 的原版也不错,我的data.table 版本就没那么多了。我想知道为什么,以及是否有一种方法可以使 data.table
运行 在内存分配方面也更有效。
有一个数据框,我想生成一个新的 list-column 包含命名向量(每行一个向量)。每个向量的名称和值都来自其他 2 个数据框列。但我被卡住了,因为我想这样做:
- 按组
- 尽可能computationally-efficient
例子
让我们用 {ggplot2}
的 mpg
数据集来说明 by group 原则。我想将成对的 cty
和 hwy
值放在一起,按 manufacturer
和 year
的不同组合分组。所以我们可以这样做:
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
my_mpg <-
mpg %>%
select(manufacturer, year, cty, hwy)
via_tidyr_nest <-
my_mpg %>%
group_by(manufacturer, year) %>%
nest()
via_tidyr_nest
#> # A tibble: 30 x 3
#> # Groups: manufacturer, year [30]
#> manufacturer year data
#> <chr> <int> <list>
#> 1 audi 1999 <tibble [9 x 2]>
#> 2 audi 2008 <tibble [9 x 2]>
#> 3 chevrolet 2008 <tibble [12 x 2]>
#> 4 chevrolet 1999 <tibble [7 x 2]>
#> 5 dodge 1999 <tibble [16 x 2]>
#> 6 dodge 2008 <tibble [21 x 2]>
#> 7 ford 1999 <tibble [15 x 2]>
#> 8 ford 2008 <tibble [10 x 2]>
#> 9 honda 1999 <tibble [5 x 2]>
#> 10 honda 2008 <tibble [4 x 2]>
#> # ... with 20 more rows
由 reprex package (v0.3.0)
于 2021-09-27 创建这很完美,只是我不想要嵌套的 tibble,而是想要嵌套的命名向量。 (原因:一旦我们将输出存储为环境中的 object,命名的矢量版本比嵌套的 tibble 版本更轻)。
可行但不受欢迎的解决方案 将采用 via_tidyr_nest
并将嵌套的 tibble 转换为命名向量。
expected_output <-
via_tidyr_nest %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
expected_output
#> # A tibble: 30 x 3
#> # Groups: manufacturer, year [30]
#> manufacturer year desired_named_vec
#> <chr> <int> <list>
#> 1 audi 1999 <int [9]>
#> 2 audi 2008 <int [9]>
#> 3 chevrolet 2008 <int [12]>
#> 4 chevrolet 1999 <int [7]>
#> 5 dodge 1999 <int [16]>
#> 6 dodge 2008 <int [21]>
#> 7 ford 1999 <int [15]>
#> 8 ford 2008 <int [10]>
#> 9 honda 1999 <int [5]>
#> 10 honda 2008 <int [4]>
#> # ... with 20 more rows
这是不希望的,因为它通过迂回实现了预期的输出。首先,它创建一个 tibble,然后转换为一个命名向量。虽然在此示例中处理时间可以忽略不计,但实际上我有一个大型数据集(1000 万行)。因此,增加任何额外的步骤都是昂贵的。相反,我希望以尽可能少的步骤到达 expected_output
。
一次不成功的尝试:
library(purrr)
via_summarise_map2_setnames <-
my_mpg %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(.x = cty, .y = hwy, .f = ~setNames(.x, .y)))
#> `summarise()` has grouped output by 'manufacturer', 'year'. You can override using the `.groups` argument.
via_summarise_map2_setnames
#> # A tibble: 234 x 3
#> # Groups: manufacturer, year [30]
#> manufacturer year named_vec
#> <chr> <int> <list>
#> 1 audi 1999 <int [1]>
#> 2 audi 1999 <int [1]>
#> 3 audi 1999 <int [1]>
#> 4 audi 1999 <int [1]>
#> 5 audi 1999 <int [1]>
#> 6 audi 1999 <int [1]>
#> 7 audi 1999 <int [1]>
#> 8 audi 1999 <int [1]>
#> 9 audi 1999 <int [1]>
#> 10 audi 2008 <int [1]>
#> # ... with 224 more rows
知道如何直接从 my_mpg
转到 expected_output
,而不创建小标题 in-between 吗?
编辑
只是在这个问题的背景下的一般想法。我知道 tidyr::nest()
的默认行为是 return 一个嵌套的 tibble。但是我没有找到关于这个决定的任何讨论。换句话说,如果我们想自己选择嵌套数据的 class 怎么办?它可以是默认的 tibble
,也可以是 data.frame
、data.table
、named vector
等。无论用户选择什么作为输出 class。
已编辑:将 'map' 替换为 'Map'
希望这对您有用。 您的解决方案在 'f' 内,我的建议在 'g' 内。 它使用 dplyr 'group_by' 创建的索引来收集构建命名向量所需的数据。
f <- function () {
via_tidyr_nest <-
my_mpg %>%
group_by(manufacturer, year) %>%
nest()
expected_output <-
via_tidyr_nest %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
}
g <- function () {
df1 <- my_mpg %>% group_by(manufacturer, year)
df2 <- attr(df1,"groups")
Map(function(rows) {
r <- df1[rows,"cty",drop=TRUE]
setNames(r,df1[rows,"hwy",drop=TRUE])
},
df2$.rows
) -> l
df <- data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}
# other solutions
h <- function () {
hdf <- my_mpg %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)))
}
k <- function() {
mpg |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
}
library(microbenchmark)
microbenchmark(OP=f(),Nicolas2=g(),Rui=h(),Till=k())
Unit: milliseconds
expr min lq mean median uq max neval
OP 21.8917 22.64035 24.389126 23.28235 24.70075 39.9593 100
Nicolas2 3.0507 3.15920 3.481469 3.24625 3.57840 7.3173 100
Rui 6.5460 6.75300 7.505564 7.16255 7.64390 12.0359 100
Till 31.2364 32.31115 34.940356 32.92990 36.11505 107.2709 100
这是一个方法。在设置名称之前将 cty
和 hwy
强制转换为 "list"
。好像还行。
library(purrr)
library(dplyr)
data(mpg, package = "ggplot2")
my_mpg <-
mpg %>%
select(manufacturer, year, cty, hwy)
my_mpg %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)))
#`summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
## A tibble: 30 x 3
## Groups: manufacturer [15]
# manufacturer year named_vec
# <chr> <int> <list>
# 1 audi 1999 <int [9]>
# 2 audi 2008 <int [9]>
# 3 chevrolet 1999 <int [7]>
# 4 chevrolet 2008 <int [12]>
# 5 dodge 1999 <int [16]>
# 6 dodge 2008 <int [21]>
# 7 ford 1999 <int [15]>
# 8 ford 2008 <int [10]>
# 9 honda 1999 <int [5]>
#10 honda 2008 <int [4]>
## … with 20 more rows
基准
由于问题是性能问题,这里是 4 个提出的解决方案的基准,到目前为止,问题的
f <- function(X) {
X %>%
group_by(manufacturer, year) %>%
nest() %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
}
g <- function(X) {
df1 <- X %>% group_by(manufacturer, year)
df2 <- attr(df1,"groups")
Map(function(rows) {
r <- df1[rows,"cty",drop=TRUE]
setNames(r,df1[rows,"hwy",drop=TRUE])
},
df2$.rows
) -> l
data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}
h <- function(X){
X %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)), .groups = "drop")
}
i <- function(X){
X |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
}
mb <- microbenchmark(
Emman = f(my_mpg),
Nicolas2 = g(my_mpg),
Rui = h(my_mpg),
Till = i(my_mpg)
)
print(mb, unit = "relative", order = "median")
#Unit: relative
# expr min lq mean median uq max neval cld
# Rui 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# Nicolas2 1.527957 1.468524 1.478286 1.482185 1.471565 1.724004 100 b
# Emman 4.504185 4.230921 4.215643 4.234087 4.148188 4.170934 100 c
# Till 6.264028 5.813678 5.883107 5.810876 5.744080 5.666524 100 d
dplyr::group_modify()
和tibble::deframe()
。而不是 deframe()
您问题中的 pull(x, cty, hwy)
会起到同样的作用。
library(tidyverse)
mpg |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
#> # A tibble: 30 × 3
#> # Groups: manufacturer, year [30]
#> manufacturer year res
#> <chr> <int> <list>
#> 1 audi 1999 <int [9]>
#> 2 audi 2008 <int [9]>
#> 3 chevrolet 1999 <int [7]>
#> 4 chevrolet 2008 <int [12]>
#> 5 dodge 1999 <int [16]>
#> 6 dodge 2008 <int [21]>
#> 7 ford 1999 <int [15]>
#> 8 ford 2008 <int [10]>
#> 9 honda 1999 <int [5]>
#> 10 honda 2008 <int [4]>
#> # … with 20 more rows
到目前为止,@RuiBarradas 的解决方案是最快的。我想出了一个 data.table
版本,它似乎提高了一点速度。
library(ggplot2)
library(purrr)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
my_mpg <-
mpg %>%
select(manufacturer, year, hwy, cty)
my_mpg %>%
as.data.table() %>%
.[,.(named_vec = map2(.x = list(cty), .y = list(hwy), .f = ~setNames(.x, .y))),.(manufacturer, year)] %>%
as_tibble()
#> # A tibble: 30 x 3
#> manufacturer year named_vec
#> <chr> <int> <list>
#> 1 audi 1999 <int [9]>
#> 2 audi 2008 <int [9]>
#> 3 chevrolet 2008 <int [12]>
#> 4 chevrolet 1999 <int [7]>
#> 5 dodge 1999 <int [16]>
#> 6 dodge 2008 <int [21]>
#> 7 ford 1999 <int [15]>
#> 8 ford 2008 <int [10]>
#> 9 honda 1999 <int [5]>
#> 10 honda 2008 <int [4]>
#> # ... with 20 more rows
由 reprex package (v0.3.0)
于 2021-09-28 创建基准测试
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(tibble)
library(purrr)
library(data.table, warn.conflicts = FALSE)
my_mpg <-
mpg %>%
select(manufacturer, year, cty, hwy)
f <- function(X) {
X %>%
group_by(manufacturer, year) %>%
nest() %>%
mutate(desired_named_vec = map(.x = data, .f = ~pull(.x, cty, hwy))) %>%
select(-data)
}
g <- function(X) {
df1 <- my_mpg %>% group_by(manufacturer, year)
df2 <- attr(df1,"groups")
Map(function(rows) {
r <- df1[rows,"cty",drop=TRUE]
setNames(r,df1[rows,"hwy",drop=TRUE])
},
df2$.rows
) -> l
data.frame(manufacturer=df2$manufacturer,year=df2$year,named_vector=I(l))
}
h <- function(X){
X %>%
group_by(manufacturer, year) %>%
summarise(named_vec = map2(list(cty), list(hwy), ~set_names(.x, .y)), .groups = "drop")
}
i <- function(X){
X |>
select(manufacturer, year, cty, hwy) |>
group_by(manufacturer, year) |>
group_modify(\(x, ...) tibble(res = list(deframe(x))))
}
j <- function(X){
X %>%
as.data.table() %>%
.[,.(named_vec = map2(.x = list(cty), .y = list(hwy), .f = ~setNames(.x, .y))),.(manufacturer, year)] %>%
as_tibble()
}
library(microbenchmark)
library(bench)
mb <- microbenchmark(
Emman_OP = f(my_mpg),
Nicolas2 = g(my_mpg),
Rui = h(my_mpg),
Till = i(mpg),
Emman_data.table_version_of_Rui = j(my_mpg)
)
print(mb, unit = "relative", order = "median")
#> Unit: relative
#> expr min lq mean median
#> Emman_data.table_version_of_Rui 1.000000 1.000000 1.000000 1.000000
#> Rui 2.472627 2.457073 2.392998 2.409865
#> Nicolas2 3.317832 3.317378 3.116434 3.295358
#> Emman_OP 10.255926 10.472251 9.842886 10.674290
#> Till 14.061003 14.333661 13.115049 14.937978
#> uq max neval
#> 1.000000 1.000000 100
#> 2.395210 2.191381 100
#> 3.258533 2.719938 100
#> 10.572811 5.331644 100
#> 14.086673 5.418907 100
不错,@Rui的data.table
适配是最快的。
但!
如果我们看一下内存分配(这是性能的另一个方面):
bm <- bench::mark(Emman_OP = f(my_mpg),
Nicolas2 = g(my_mpg),
Rui = h(my_mpg),
Till = i(mpg),
Emman_data.table_version_of_Rui = j(my_mpg), check = FALSE)
summary(bm, relative = TRUE)
#> # A tibble: 5 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Emman_OP 12.0 11.8 1.24 13.8 1.18
#> 2 Nicolas2 3.75 3.67 3.84 1 1.08
#> 3 Rui 2.77 2.76 5.18 1.49 1.06
#> 4 Till 15.1 15.7 1 7.10 1.45
#> 5 Emman_data.table_version_of_Rui 1 1 14.3 7.53 1
可以看到@Nicols2 是最轻的,@Rui 的原版也不错,我的data.table 版本就没那么多了。我想知道为什么,以及是否有一种方法可以使 data.table
运行 在内存分配方面也更有效。