在关键字后的 space 上单独列
Separate column on a space after keyword
我有一个包含字符串的数据框列,其中可能包含多个 space。我想在第一次出现关键字(即示例数据中的 fruit_key
)后,在 space 上使用 tidyr
(或类似的东西)中的 separate
,以便我把一栏分成两栏。
示例数据
df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon",
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler",
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA,
-7L))
fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")
预期输出
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
我可以将带有 separate
的关键字后的部分放入正确的列(即 Tasty
),但无法将另一列的实际关键字放入 return(即 Delicious
)。我尝试了几次更改正则表达式,但始终无法获得正确的输出。
library(tidyr)
separate(df, fruit,
c("Delicious", "Tasty"),
sep = paste(fruit_key, collapse = "|"),
extra = "merge",
remove = FALSE
)
# fruit Delicious Tasty
#1 Apple Orange Pineapple Orange Pineapple
#2 Plum Good Watermelon Watermelon
#3 Plum Good Kiwi Kiwi
#4 Plum Good Plum Good Plum Good
#5 Cantaloupe Melon Melon
#6 Blueberry Blackberry Cobbler Blackberry Cobbler
#7 Peach Pie Apple Pie Apple Pie
我知道我可以使用 str_extract
和 str_remove
(如下所示),但我想使用 separate
之类的东西在一个 function/step 中完成。
library(tidyverse)
df %>%
mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))
如果我们需要将 separate
与 sep
一起使用,则创建一个正则表达式环视 - "(?<=<fruit_key>) "
即在 fruit_key 之后的 space 处拆分word 和 as 没有被向量化,collapse
变成一个字符串 |
(str_c
)
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate(fruit, into = c("Delicious", "Tasty"),
sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"),
extra = "merge", remove = FALSE)
-输出
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
这是一个使用 tidyr
函数 extract
:
的简洁解决方案
library(tidyr)
df %>%
extract(fruit,
into = c("Delicious", "Tasty"),
regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\s(.*)"),
remove = FALSE)
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
在 extract
的正则表达式参数中,我们将 fruit_key
折叠成一个交替模式,我们将其括在括号中,以便将其识别为捕获组。第二个捕获组就是空白后面的任何内容。
我有一个包含字符串的数据框列,其中可能包含多个 space。我想在第一次出现关键字(即示例数据中的 fruit_key
)后,在 space 上使用 tidyr
(或类似的东西)中的 separate
,以便我把一栏分成两栏。
示例数据
df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon",
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler",
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA,
-7L))
fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")
预期输出
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
我可以将带有 separate
的关键字后的部分放入正确的列(即 Tasty
),但无法将另一列的实际关键字放入 return(即 Delicious
)。我尝试了几次更改正则表达式,但始终无法获得正确的输出。
library(tidyr)
separate(df, fruit,
c("Delicious", "Tasty"),
sep = paste(fruit_key, collapse = "|"),
extra = "merge",
remove = FALSE
)
# fruit Delicious Tasty
#1 Apple Orange Pineapple Orange Pineapple
#2 Plum Good Watermelon Watermelon
#3 Plum Good Kiwi Kiwi
#4 Plum Good Plum Good Plum Good
#5 Cantaloupe Melon Melon
#6 Blueberry Blackberry Cobbler Blackberry Cobbler
#7 Peach Pie Apple Pie Apple Pie
我知道我可以使用 str_extract
和 str_remove
(如下所示),但我想使用 separate
之类的东西在一个 function/step 中完成。
library(tidyverse)
df %>%
mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))
如果我们需要将 separate
与 sep
一起使用,则创建一个正则表达式环视 - "(?<=<fruit_key>) "
即在 fruit_key 之后的 space 处拆分word 和 as 没有被向量化,collapse
变成一个字符串 |
(str_c
)
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate(fruit, into = c("Delicious", "Tasty"),
sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"),
extra = "merge", remove = FALSE)
-输出
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
这是一个使用 tidyr
函数 extract
:
library(tidyr)
df %>%
extract(fruit,
into = c("Delicious", "Tasty"),
regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\s(.*)"),
remove = FALSE)
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
在 extract
的正则表达式参数中,我们将 fruit_key
折叠成一个交替模式,我们将其括在括号中,以便将其识别为捕获组。第二个捕获组就是空白后面的任何内容。