结合 {stringr} 和 mutate() 一次操作大型数据集中的多个列？

Question

Reprex

df <- tibble(name = c("Person_1","Person_2","Person_3"),
             `AxxBxx1:0` = c("1:04","2:02","0:1"),
             `AxxCxx5:0` = c("5:04","3:02","0:0"),
             `BxxCxx2:1` = c("2:14","1:03","0:1"))

目标是把这个data.frame改成另一个，其中以_real结尾的变量取自列名，_bet和_result来来自df的变量值的第一部分和第二部分：

df_2 <- tibble(name = c("Person_1","Person_2","Person_3"),
               AxxBxx_real = "1:0",
               AxxCxx_real = "5:0",
               BxxCxx_real = "2:1",
               AxxBxx_bet = c("1:0","2:0","0:1"),
               AxxCxx_bet = c("5:0","3:0","0:0"),
               BxxCxx_bet = c("2:1","1:0","0:1"),
               AxxBxx_result = c("4","2",""),
               AxxCxx_result = c("4","2",""),
               BxxCxx_result = c("4","3",""))

问题： 实际数据集比 df 大得多，理想情况下我想自动将 df 转换为 df_2尽可能多。

代码（即我目前所做的）

library(tidyverse)

# Step 1: Get real match results from variable names.
df$AxxBxx_real <- "1:0"
df$AxxCxx_real <- "5:0"
df$BxxCxx_real <- "2:1"

有没有办法一次性将df中的原始变量mutate()转化为这三个_real变量，而不需要查看单个匹配结果？ mutate(names(df)[2:4] = str_extract(...)) 或类似的东西显然不起作用。

# Create `_bet` and `_result` variables.
str_remove(names(df)[2:4], "[0-99]:[0-99]") %>%
  paste0("_bet") -> names(df)[2:4]

df %>%
  mutate(AxxBxx_result = AxxBxx_bet,
         AxxCxx_result = AxxCxx_bet,
         BxxCxx_result = BxxCxx_bet) -> df

df$AxxBxx_bet <- str_extract(df$AxxBxx_bet, "[0-99]:[0-99]")
df$AxxCxx_bet <- str_extract(df$AxxCxx_bet, "[0-99]:[0-99]")
df$BxxCxx_bet <- str_extract(df$BxxCxx_bet, "[0-99]:[0-99]")

df$AxxBxx_result <- str_remove(df$AxxBxx_result, "[0-99]:[0-99]")
df$AxxCxx_result <- str_remove(df$AxxCxx_result, "[0-99]:[0-99]")
df$BxxCxx_result <- str_remove(df$BxxCxx_result, "[0-99]:[0-99]")

这里的问题是，尽管 data.frame 中每个人的投注结果的拆分过程在某种程度上是标准化的，但创建新变量的方式和存储不是。我不想对每个变量单独执行此操作，而是希望自动完成此操作。获取原始变量的名称并从名称中删除结果，然后拆分为 _bet 和 _real。同样，问题是我一次只能 mutate() 任何给定的变量。有没有更好更省时的方式？

Answer 1

这是使用 tidyr 库执行此操作的方法：

这会以长格式获取数据，将列名分成两部分，使用 extract 我们将值分成两列，最后以宽格式获取数据。

我建议你运行一步一步来了解这里发生了什么。

library(tidyr)

df %>%
  pivot_longer(cols = -name, 
               names_to = c('col1', 'real'), 
               names_pattern = '([A-Za-z]+)(\d+:\d+)') %>%
  extract(value, c('bet', 'result'), '(\d+:.)(.)?') %>%
  pivot_wider(names_from = col1, values_from = c(real, bet, result), 
              names_glue = '{col1}_{.value}')

#  name     AxxBxx_real AxxCxx_real BxxCxx_real AxxBxx_bet AxxCxx_bet BxxCxx_bet AxxBxx_result AxxCxx_result BxxCxx_result
#  <chr>    <chr>       <chr>       <chr>       <chr>      <chr>      <chr>      <chr>         <chr>         <chr>        
#1 Person_1 1:0         5:0         2:1         1:0        5:0        2:1        "4"           "4"           "4"          
#2 Person_2 1:0         5:0         2:1         2:0        3:0        1:0        "2"           "2"           "3"          
#3 Person_3 1:0         5:0         2:1         0:1        0:0        0:1        ""            ""            ""

我不确定为什么 name 在输入中从 Person_1、Person_2、Person_3 变为 A、B 和 C 在输出中。我在这里保持名称相同。

结合 {stringr} 和 mutate() 一次操作大型数据集中的多个列？

Combine {stringr} and mutate() to manipulate multiple columns at once in large dataset?

r

stringr

dplyr