在字符串中仅提取两个下划线之间的数字模式

Question

我对正则表达式比较陌生，运行进入了死胡同。我有一个数据框，其中有一列如下所示：

year1
GMM14_2000_NGVA
GMM14_2001_NGVA
GMM14_2002_NGVA
...
GMM14_2014_NGVA

我正在尝试提取字符串中间的年份（2000、2001 等）。到目前为止，这是我的代码

gsub("[^0-9]","",year1))

其中 returns 数字但它也是 returns 作为字符串一部分的 14:

142000
142001

关于如何从模式中排除 14 或者如何更有效地提取年份信息有什么想法吗？

谢谢

Answer 1

使用以下gsub:

s  = "GMM14_2002_NGVA"
gsub("^[^_]*_|_[^_]*$", "", s)

见IDEONE demo

正则表达式细分：

匹配...

^[^_]*_ - 除了 _ 之外的 0 个或更多字符，从字符串的开头和 _
| - 或...
_[^_]*$ - _ 和 _ 以外的 0 个或多个字符到字符串

并删除它们。

作为替代方案，

library(stringr)
str_extract(s,"(?<=_)\d{4}(?=_)")

其中类似 Perl 的正则表达式匹配用下划线括起来的 4 位子字符串。

Answer 2

使用stringi包，下面是一种方法。假设年份是 4 位数。由于您指定了数字，这非常简单。

library(stringi)

x <- c("GMM14_2000_NGVA", "GMM14_2001_NGVA")

stri_extract_last(x, regex = "\d{4}")
#[1] "2000" "2001"

或

stri_extract_first(x, regex = "\d{4}")
#[1] "2000" "2001"

Answer 3

base-R 中的另一个选项是strsplit 使用@jazzurro 的数据：

x <- c("GMM14_2000_NGVA", "GMM14_2001_NGVA")

vapply(strsplit(x, '_'), function(x) x[2], character(1))
[1] "2000" "2001"

strsplit 在下划线 _ 上拆分 x 向量的每个元素，并输出与长度 x 相同长度的列表。使用 vapply 我们收集列表中每个向量的第二个元素，即下划线之间的年份。

Answer 4

你可以使用 sub.

sub(".*_(\d{4})_.*", "\1", x)

或

devtools::install_github("Avinash-Raj/dangas")
library(dangas)
extract_a("_", "_", x)

这将提取开始和结束分隔符之间存在的所有字符。这里的开始和结束分隔符是下划线。

语法：

extract_a(start, end, string)

Answer 5

我从未使用过 R，但对正则表达式有很深的经验。

惯用的正确方法是使用匹配。

对于 R 它应该是 regmatches:

Use regmatches to get the actual substrings matched by the regular expression. As the first argument, pass the same input that you passed to regexpr or gregexpr . As the second argument, pass the vector returned by regexpr or gregexpr. If you pass the vector from regexpr then regmatches returns a character vector with all the strings that were matched. This vector may be shorter than the input vector if no match was found in some of the elements. If you pass the vector from regexpr then regmatches returns a vector with the same number of elements as the input vector. Each element is a character vector with all the matches of the corresponding element in the input vector, or NULL if an element had no matches.

>x <- c("abc", "def", "cba a", "aa")
> m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1]  "a"  "a"  "aa"

在你的情况下应该是：

m <- regexpr("\d{4}", year1, perl=TRUE)
regmatches(year1, m)

如果您可以在同一字符串中连续使用另外 4 位数字，则可以使用 non capturing groups。大概是这样的：

"(?:_)\d{4}(?:_)"

抱歉，没有机会在 R 中测试所有这些。

在字符串中仅提取两个下划线之间的数字模式

Extract a numeric pattern between two only underscores in string

regex

r

gsub