如何使用地图函数在数据帧列表中使用 str_remove() ？

Question

我有一个数据框列表，它们都包含匹配的 ID 列。

例如...

dat1 = tribble(
    ~id, ~response,
    "id_1", 10,
    "id_2", 15
  )

  dat2 = tribble(
    ~id, ~response,
    "id_3", 20,
    "id_4", 25
  )

example_list <- list(dat1, dat2)

> list(dat1, dat2)

[[1]]
# A tibble: 2 × 2
  id    response
  <chr>    <dbl>
1 id_1        10
2 id_2        15

[[2]]
# A tibble: 2 × 2
  id    response
  <chr>    <dbl>
1 id_3        20
2 id_4        25

如何使用 str_remove() 跨数据帧映射以删除 id 列中每一行的“id_”前缀？

Answer 1

用purrr::map，然后str_remove（或gsub或readr::parse_number）。

library(tidyverse)
example_list %>% 
  map(~ mutate(.x, id = str_remove(id, "id_")))
  #map(~ .x %>% mutate(id = gsub("id_", "", id)))
  #map(~ mutate(.x, id = parse_number(id)))

输出

[[1]]
# A tibble: 2 × 2
  id    response
  <chr>    <dbl>
1 1           10
2 2           15

[[2]]
# A tibble: 2 × 2
  id    response
  <chr>    <dbl>
1 3           20
2 4           25

Answer 2

您可以嵌套 modify_at() 以获得更快的速度。此外，substring 应该比某些文本匹配更快，因为您已经知道前缀的长度。

当然，您可能需要 as.integer() 将其转换回数字，但这与解决方案无关。

library(purrr)

example_list %>% 
  map(modify_at, "id", substring, 4)

# [[1]]
# # A tibble: 2 x 2
#   id    response
#   <chr>    <dbl>
# 1 1           10
# 2 2           15
# 
# [[2]]
# # A tibble: 2 x 2
#   id    response
#   <chr>    <dbl>
# 1 3           20
# 2 4           25

# to convert to integer
example_list %>% 
    map(modify_at, "id", ~ as.integer(substring(.x, 4)))

运行几个选项作为基准：

library(purrr)
library(dplyr)
library(stringr)

microbenchmark::microbenchmark(
  modify_substring = example_list %>% 
    map(modify_at, "id", substring, 4),
  
  mutate_substring = example_list %>% 
    map(~ mutate(.x, id = substring(id, 4))),
  
  mutate_str_remove = example_list %>% 
    map(~ mutate(.x, id = str_remove(id, "id_")))
)

您可以看到这种方法运行得更快。

Unit: microseconds
              expr      min        lq     mean    median       uq        max neval
  modify_substring  302.301  359.9005  442.340  419.6505  459.901   1597.401   100
  mutate_substring 3019.502 3308.6015 4916.405 3540.5505 3847.801 116220.501   100
 mutate_str_remove 4064.801 4568.4010 5355.351 4839.1010 5232.452  10521.701   100

如何使用地图函数在数据帧列表中使用 str_remove() ？

How can I use str_remove() within a list of dataframes using a map function?

r

stringr

purrr