在关键字后的 space 上单独列

Separate column on a space after keyword

我有一个包含字符串的数据框列,其中可能包含多个 space。我想在第一次出现关键字(即示例数据中的 fruit_key)后,在 space 上使用 tidyr(或类似的东西)中的 separate,以便我把一栏分成两栏。

示例数据

df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon", 
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler", 
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA, 
-7L))

fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")

预期输出

                         fruit   Delicious                Tasty
1       Apple Orange Pineapple       Apple     Orange Pineapple
2         Plum Good Watermelon   Plum Good           Watermelon
3               Plum Good Kiwi   Plum Good                 Kiwi
4          Plum Good Plum Good   Plum Good            Plum Good
5             Cantaloupe Melon  Cantaloupe                Melon
6 Blueberry Blackberry Cobbler   Blueberry   Blackberry Cobbler
7          Peach Pie Apple Pie   Peach Pie            Apple Pie

我可以将带有 separate 的关键字后的部分放入正确的列(即 Tasty),但无法将另一列的实际关键字放入 return(即 Delicious)。我尝试了几次更改正则表达式,但始终无法获得正确的输出。

library(tidyr)

separate(df, fruit,
 c("Delicious", "Tasty"),
 sep = paste(fruit_key, collapse = "|"),
 extra = "merge",
 remove = FALSE
)

#                         fruit Delicious               Tasty
#1       Apple Orange Pineapple              Orange Pineapple
#2         Plum Good Watermelon                    Watermelon
#3               Plum Good Kiwi                          Kiwi
#4          Plum Good Plum Good                     Plum Good
#5             Cantaloupe Melon                         Melon
#6 Blueberry Blackberry Cobbler            Blackberry Cobbler
#7          Peach Pie Apple Pie                     Apple Pie

我知道我可以使用 str_extractstr_remove(如下所示),但我想使用 separate 之类的东西在一个 function/step 中完成。

library(tidyverse)

df %>%
  mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
         Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))

如果我们需要将 separatesep 一起使用,则创建一个正则表达式环视 - "(?<=<fruit_key>) " 即在 fruit_key 之后的 space 处拆分word 和 as 没有被向量化,collapse 变成一个字符串 | (str_c)

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
   separate(fruit, into = c("Delicious", "Tasty"), 
     sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"), 
         extra = "merge", remove = FALSE)

-输出

                       fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie

这是一个使用 tidyr 函数 extract:

的简洁解决方案
library(tidyr)
df %>%
  extract(fruit,
          into = c("Delicious", "Tasty"),
          regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\s(.*)"),
          remove = FALSE)
                         fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie

extract 的正则表达式参数中,我们将 fruit_key 折叠成一个交替模式,我们将其括在括号中,以便将其识别为捕获组。第二个捕获组就是空白后面的任何内容。