在 rm_between 函数中使用逻辑运算符提取单词之间的字符串

Question

我正在尝试提取单词之间的字符串。考虑这个例子 -

x <-  "There are 2.3 million species in the world"

这也可以采用另一种形式，即

x <-  "There are 2.3 billion species in the world"

我需要 There 和“million 或 billion 之间的文本，包括它们。 million 或 billion 的存在是在运行时间决定的，不是事先决定的。所以这句话我需要的输出是

[1] There are 2.3 million 或
[2] There are 2.3 billion

我正在使用 qdapRegex 包中的 rm_between 函数。使用此命令我一次只能提取其中一个。

library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE)

或者我必须使用

rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)

如何编写一个命令来检查同一句话中是否存在 million 或 billion。像这样的 -

rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)

我希望这是清楚的。任何帮助，将不胜感激。

Answer 1

您可以使用str_extact_all（全局匹配）或str_extract（单一匹配）

library(stringr)
str_extract_all(s, "\bThere\b.*?\b(?:million|billion)\b")

或

str_extract_all(s, perl("(?<!\S)There(?=\s+).*?\s(?:million|billion)(?!\S)"))

Answer 2

rm_between 中的 left 和 right 参数采用 vector 个 character/numeric 符号。因此，您可以在两个 left/right 个参数中使用长度相等的向量。

 library(qdapRegex)
 unlist(rm_between(x, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million" "There are 2.3 billion"
 unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million"

 unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 billion"

或

  sub('\s*species.*', '', x)

数据

 x <-  c("There are 2.3 million species in the world", 
   "There are 2.3 billion species in the world")
 x1 <- "There are 2.3 million species in the world"
 x2 <- "There are 2.3 billion species in the world"

Answer 3

~~使用 rm_between 您可以为多个等长标记提供向量，如文档所述。~~

编辑

有关 rm_between 的更新参数，请参阅@TylerRinker 的。

尽管如此，您可以使用用户定义的正则表达式的另一种方法是 rm_default :

rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)

例子:

library(qdapRegex)

x <-  c(
    'There are 2.3 million species in the world',
    'There are 2.3 billion species in the world'
)

rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"

## [[2]]
## [1] "There are 2.3 billion"

Answer 4

@hwnd（我的 qdapRegex 合著者）的回复激发了一场讨论，引发了一个新的论点，fixed，rm_between。以下是开发版的描述：

rm_between and r_between_multiple pick up a fixed argument. Previously, left and right boundaries containing regular expression special characters were fixed by default (escaped). This did not allow for the powerful use of a regular expression for left/right boundaries. The fixed = TRUE behavior is still the default but users can now set fixed = FALSE to work with regular expression boundaries. This new feature was inspired by @Ronak Shah's Whosebug question:

要安装开发版本：

if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")

使用 qdapRegex 版本 >= 4.1 您可以执行以下操作。

x <-  c(
    "There are 2.3 million species in the world",
    "There are 2.3 billion species in the world"
)

rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
    include=TRUE, extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"
## 
## [[2]]
## [1] "There are 2.3 billion"

在 rm_between 函数中使用逻辑运算符提取单词之间的字符串

Extracting string between words using logical operators in rm_between function

string

r

qdapregex

数据

编辑