R:使用零宽度前瞻提取双字母组
R: Extracting Bigrams with Zero-Width Lookaheads
我想使用 here 描述的正则表达式从句子中提取二元语法,并将输出存储到引用原始内容的新列中。
library(dplyr)
library(stringr)
library(splitstackshape)
df <- data.frame(a =c("apple orange plum"))
# Single Words - Successful
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("\w+\b", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"\w+\b"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
最初,我认为问题似乎出在正则表达式引擎上,但 stringr::str_extract_all
(ICU) 和 base::regmatches
(PCRE) 都不起作用。
# Bigrams - Fails
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("(?=(\b\w+\s+\w+))", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"(?=(\b\w+\s+\w+))"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
因此,我猜测问题可能与在捕获组周围使用零宽度先行有关。 R 中是否有任何有效的正则表达式可以提取这些双字母组?
正如@WiktorStribiżew 所建议的,使用 str_extract_all
在这里有所帮助。以下是如何将它应用于数据框中的多行。让
(df <- data.frame(a = c("one two three", "four five six")))
# a
# 1 one two three
# 2 four five six
那我们可以做
df %>% rowwise() %>%
do(data.frame(., b = str_match_all(.$a, "(?=(\b\w+\s+\w+))")[[1]][, 2], stringsAsFactors = FALSE))
# Source: local data frame [4 x 2]
# Groups: <by row>
#
# A tibble: 4 x 2
# a b
# * <fct> <chr>
# 1 one two three one two
# 2 one two three two three
# 3 four five six four five
# 4 four five six five six
其中 stringsAsFactors = FALSE
只是为了避免来自绑定行的警告。
我想使用 here 描述的正则表达式从句子中提取二元语法,并将输出存储到引用原始内容的新列中。
library(dplyr)
library(stringr)
library(splitstackshape)
df <- data.frame(a =c("apple orange plum"))
# Single Words - Successful
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("\w+\b", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"\w+\b"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
最初,我认为问题似乎出在正则表达式引擎上,但 stringr::str_extract_all
(ICU) 和 base::regmatches
(PCRE) 都不起作用。
# Bigrams - Fails
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("(?=(\b\w+\s+\w+))", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"(?=(\b\w+\s+\w+))"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
因此,我猜测问题可能与在捕获组周围使用零宽度先行有关。 R 中是否有任何有效的正则表达式可以提取这些双字母组?
正如@WiktorStribiżew 所建议的,使用 str_extract_all
在这里有所帮助。以下是如何将它应用于数据框中的多行。让
(df <- data.frame(a = c("one two three", "four five six")))
# a
# 1 one two three
# 2 four five six
那我们可以做
df %>% rowwise() %>%
do(data.frame(., b = str_match_all(.$a, "(?=(\b\w+\s+\w+))")[[1]][, 2], stringsAsFactors = FALSE))
# Source: local data frame [4 x 2]
# Groups: <by row>
#
# A tibble: 4 x 2
# a b
# * <fct> <chr>
# 1 one two three one two
# 2 one two three two three
# 3 four five six four five
# 4 four five six five six
其中 stringsAsFactors = FALSE
只是为了避免来自绑定行的警告。