如何使用 stringr::str_match 在 R 中提取子字符串

How to extract substrings in R using stringr::str_match

我有以下两个字符串:

x <- "chr1:625000-635000.BB_162.Adipose"
y <- "chr1:625000-635000.BB_162.combined.HMSC-ad"

使用这个正则表达式,我可以毫无问题地捕获 x

的部分内容
> stringr::str_match(x,"(\w+):(\d+)-(\d+)\.(\w+)\.(\w+)")
     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]     
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"

我想做的是用y得到这个

     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]     
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad"  "chr1" "625000" "635000" "BB_162" "HMSC-ad"

使用我当前的正则表达式并申请 y 我得到的是:

   [,1]                                 [,2]   [,3]     [,4]     [,5]     [,6]      
[1,] "chr1:625000-635000.BB_162.combined" "chr1" "625000" "635000" "BB_162" "combined"

我如何概括我的正则表达式,以便它可以同时处理 xy

更新

S.Kalbar,你的正则表达式给出了这个:

> stringr::str_match(y,"(\w+):(\d+)-(\d+)\.(\w+)\.(\w+)(?:\.([A-Za-z-]+))?")
     [,1]                                         [,2]   [,3]     [,4]     [,5]     [,6]       [,7]     
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "combined" "HMSC-ad"
> stringr::str_match(x,"(\w+):(\d+)-(\d+)\.(\w+)\.(\w+)(?:\.([A-Za-z-]+))?")
     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]      [,7]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" NA 

我想得到的是这个 y:

                                          [,1]     [,2]   [,3]     [,4]     [,5]     [,6]        
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"

这是 x:

                                   [,1]  [,2]   [,3]     [,4]     [,5]     [,6]      
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" 

正则表达式(\w+):(\d+)-(\d+)\.(\w+)(?:\.\w+)?(?:\.([A-Za-z-]+))

RegEx demo

你可以给引擎一些令牌来分割:

(?:(?<=\d)-(?=\d))|(?:\.combined\.)|[.:]+

分解,这表示:

(?:(?<=\d)-(?=\d))  # a dash between numbers
|                     # or
(?:\.combined\.)    # .combined. literally
|                     # or
[.:]+                 # one of . or :


R 使用 str_split():

library(stringr)

x <- c("chr1:625000-635000.BB_162.Adipose", "chr1:625000-635000.BB_162.combined.HMSC-ad")
str_split(x, '(?:(?<=\d)-(?=\d))|(?:\.combined\.)|[.:]+', simplify = TRUE)

产生

     [,1]   [,2]     [,3]     [,4]     [,5]     
[1,] "chr1" "625000" "635000" "BB_162" "Adipose"
[2,] "chr1" "625000" "635000" "BB_162" "HMSC-ad"