将 dplyr 代码转换为接受列作为参数的函数

Question

我一直在努力理解 tidyeval 和 quo、quos、sym、!!、[=18= 的使用] 之类的。我做了一些尝试，但无法概括我的代码，因此它接受列向量并将文本处理应用于数据框上的这些列。我的数据框如下所示：

ocupation      tasks                 id 
 Sink Cleaner   Cleaning the sink    1
 Lion petter    Pet the lions        2

我的代码如下所示：

stopwords_regex = paste(tm::stopwords('en'), collapse = '\b|\b')
stopwords_regex = glue('\b{stopwords_regex}\b')


df = df %>% mutate(ocupation_proc = ocupation %>% tolower() %>% 
                     stringi::stri_trans_general("Latin-ASCII") %>% 
                     str_remove_all(stopwords_regex) %>% 
                     str_remove_all("[[:punct:]]") %>%  
                     str_squish(),
                   tasks_proc = tasks %>% tolower() %>% 
                     stringi::stri_trans_general("Latin-ASCII") %>% 
                     str_remove_all(stopwords_regex) %>%
                     str_remove_all("[[:punct:]]") %>% 
                     str_squish())

它带来了这样的东西：

ocupation      tasks               id    ocupation_proc  tasks_proc
Sink Cleaner   Cleaning the sink   1     sink cleaner   cleaning sink
Lion petter    Pet the lions       2      lion petter    pet lions

我想把它变成一个函数 process_text_columns(df, columns_list, new_col_names) 在这种情况下 df=df、columns_list=c('ocupation', 'tasks') 和 new_col_names=c('ocupation_proc', 'tasks_proc')，（new_col_names 可能不会如果我可以做类似 glue({colname}_proc) 的事情来命名新列，甚至是必要的）。根据我收集到的信息，我需要使用 across、sym、quos 甚至 !!! 来概括该函数，但我尝试过的任何方法都失败了。你有什么想法吗？

谢谢

Answer 1

这对您有用吗？ 2019 年 6 月引入 rlang 0.4 的“curly curly”运算符有助于简化 "quote-and-unquote into a single interpolation step."

clean_steps <- function(a_column) {
  a_column %>%
    tolower() %>% 
    stringi::stri_trans_general("Latin-ASCII") %>% 
    str_remove_all(stopwords_regex) %>%
    str_remove_all("[[:punct:]]") %>% 
    str_squish()
}

my_great_function <- function(df, columns_list, new_col_names) {
  mutate(df, across( {{columns_list}}, ~clean_steps(.x))) %>%
    rename( !!new_col_names )
}

my_great_function(df, 
                  c(ocupation, tasks), 
                  c(ocu = "ocupation", tas = "tasks"))

输出

           ocu           tas id
1 sink cleaner cleaning sink  1
2  lion petter     pet lions  2

编辑：要保留未处理的列并使用新名称添加已处理的列，最简单的方法是使用 across 的 .names 参数：

my_great_function <- function(df, columns_list, new_col_names) {
  mutate(df, across( {{columns_list}}, ~clean_steps(.x), .names = "{.col}_proc"))
}

my_great_function(df, c(ocupation, tasks))


     ocupation             tasks id ocupation_proc    tasks_proc
1 Sink Cleaner Cleaning the sink  1   sink cleaner cleaning sink
2  Lion petter     Pet the lions  2    lion petter     pet lions

将 dplyr 代码转换为接受列作为参数的函数

Turning dplyr code into function that accepts columns as arguments

r

function

dplyr

tidyeval