在关键字后的 space 上单独列

Question

我有一个包含字符串的数据框列，其中可能包含多个 space。我想在第一次出现关键字（即示例数据中的 fruit_key）后，在 space 上使用 tidyr（或类似的东西）中的 separate，以便我把一栏分成两栏。

示例数据

df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon", 
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler", 
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA, 
-7L))

fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")

预期输出

                         fruit   Delicious                Tasty
1       Apple Orange Pineapple       Apple     Orange Pineapple
2         Plum Good Watermelon   Plum Good           Watermelon
3               Plum Good Kiwi   Plum Good                 Kiwi
4          Plum Good Plum Good   Plum Good            Plum Good
5             Cantaloupe Melon  Cantaloupe                Melon
6 Blueberry Blackberry Cobbler   Blueberry   Blackberry Cobbler
7          Peach Pie Apple Pie   Peach Pie            Apple Pie

我可以将带有 separate 的关键字后的部分放入正确的列（即 Tasty），但无法将另一列的实际关键字放入 return（即 Delicious）。我尝试了几次更改正则表达式，但始终无法获得正确的输出。

library(tidyr)

separate(df, fruit,
 c("Delicious", "Tasty"),
 sep = paste(fruit_key, collapse = "|"),
 extra = "merge",
 remove = FALSE
)

#                         fruit Delicious               Tasty
#1       Apple Orange Pineapple              Orange Pineapple
#2         Plum Good Watermelon                    Watermelon
#3               Plum Good Kiwi                          Kiwi
#4          Plum Good Plum Good                     Plum Good
#5             Cantaloupe Melon                         Melon
#6 Blueberry Blackberry Cobbler            Blackberry Cobbler
#7          Peach Pie Apple Pie                     Apple Pie

我知道我可以使用 str_extract 和 str_remove（如下所示），但我想使用 separate 之类的东西在一个 function/step 中完成。

library(tidyverse)

df %>%
  mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
         Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))

Answer 1

如果我们需要将 separate 与 sep 一起使用，则创建一个正则表达式环视 - "(?<=<fruit_key>) " 即在 fruit_key 之后的 space 处拆分word 和 as 没有被向量化，collapse 变成一个字符串 | (str_c)

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
   separate(fruit, into = c("Delicious", "Tasty"), 
     sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"), 
         extra = "merge", remove = FALSE)

-输出

                       fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie

Answer 2

这是一个使用 tidyr 函数 extract:

的简洁解决方案

library(tidyr)
df %>%
  extract(fruit,
          into = c("Delicious", "Tasty"),
          regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\s(.*)"),
          remove = FALSE)
                         fruit  Delicious              Tasty
1       Apple Orange Pineapple      Apple   Orange Pineapple
2         Plum Good Watermelon  Plum Good         Watermelon
3               Plum Good Kiwi  Plum Good               Kiwi
4          Plum Good Plum Good  Plum Good          Plum Good
5             Cantaloupe Melon Cantaloupe              Melon
6 Blueberry Blackberry Cobbler  Blueberry Blackberry Cobbler
7          Peach Pie Apple Pie  Peach Pie          Apple Pie

在 extract 的正则表达式参数中，我们将 fruit_key 折叠成一个交替模式，我们将其括在括号中，以便将其识别为捕获组。第二个捕获组就是空白后面的任何内容。

在关键字后的 space 上单独列

Separate column on a space after keyword

r

dplyr

tidyr