在 R 中的特定模式之后的任何位置查找第一个数字

Question

我正在尝试识别出现在 R 中某个子字符串之后的数字。

例如：

sa <- "100 dollars 200"

在上面的字符串中，要找到单词 dollar 之后出现的数字，我执行以下代码：

str_match_all(sa,"(?<=dollars )\d+")

我得到以下结果：

  [[1]]
     [,1] 
[1,] "200"

但是，当我使用以下输入时：

sa <- "100 dollars for 200 pesos"

我无法获得 200 的输出。

Answer 1

您可以捕获 0 个或多个非数字之后的数字。 Thestr_matchfunction differs from thestr_extract` 在这方面，它保留了所有捕获组值。

> sa<-"100 dollars for 200 pesos"
> str_match_all(sa,"dollars\D*(\d+)")
[[1]]
     [,1]              [,2] 
[1,] "dollars for 200" "200"

只需使用第二列中的值。

图案详情

dollars - 匹配 dollars 子串
\D* - 除数字以外的零个或多个字符（它也匹配空格）
(\d+) - 第 1 组：一个或多个数字。

要仅提取 200 值，您可以使用 regmatches/regexpr:

sa<-c("100 dollars for 200 pesos", "100 dollars 200 pesos")
regmatches(sa, regexpr("dollars\D*\K\d+", sa, perl=TRUE))
## => [1] "200" "200"

参见R demo。

详情

dollars - 子串
\D* - 除数字
\K - 匹配重置运算符
\d+ - 1 个或多个数字。

.* 和 prefix/suffix 的相同模式可以与 sub 一起使用（不需要 gsub 因为我们只需要一个搜索和替换操作：

sa<-c("100 dollars for 200 pesos", "100 dollars 200 pesos")
sub(".*dollars\D*(\d+).*", "\1", sa)
## => [1] "200" "200"

见yet another demo

Answer 2

另一种方法是简单地使用 gsub() 来获取您想要的号码。更具体地说，您可以定义一个模式，该模式将搜索单词 'dollars'.

后的第一个数字

# define the pattern
pat <- "^.*dollars.*?([0-9]+).*"

# example 1
str <- "100 dollars for 200 pesos"
gsub(pat, "\1", str)
[1] "200"

# example 2
str <- " 100, actually 100.12 dollars for 200 pesos or 1000 dimes"
gsub(pat, "\1", str)
[1] "200"

为了更好地解释模式：

^        >> from the beginning of the string...
.*       >> every character till... 
dollars  >> the substring 'dollars'...
.*?      >> and than any character until the first...
([0-9]+) >> number of any length, that is selected as group...
.*       >> and then everything else

当此模式匹配时，gsub() 将其替换为选为组的数字，即 'dollars' 之后的第一个数字。

在 R 中的特定模式之后的任何位置查找第一个数字

Find first number anywhere after a specific pattern in R

regex

r

lookaround

regex-lookarounds