如何使用 stringr::str_match 在 R 中提取子字符串
How to extract substrings in R using stringr::str_match
我有以下两个字符串:
x <- "chr1:625000-635000.BB_162.Adipose"
y <- "chr1:625000-635000.BB_162.combined.HMSC-ad"
使用这个正则表达式,我可以毫无问题地捕获 x
的部分内容
> stringr::str_match(x,"(\w+):(\d+)-(\d+)\.(\w+)\.(\w+)")
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"
我想做的是用y
得到这个
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"
使用我当前的正则表达式并申请 y
我得到的是:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.combined" "chr1" "625000" "635000" "BB_162" "combined"
我如何概括我的正则表达式,以便它可以同时处理 x
和 y
?
更新
S.Kalbar,你的正则表达式给出了这个:
> stringr::str_match(y,"(\w+):(\d+)-(\d+)\.(\w+)\.(\w+)(?:\.([A-Za-z-]+))?")
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "combined" "HMSC-ad"
> stringr::str_match(x,"(\w+):(\d+)-(\d+)\.(\w+)\.(\w+)(?:\.([A-Za-z-]+))?")
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" NA
我想得到的是这个 y
:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"
这是 x
:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"
正则表达式:(\w+):(\d+)-(\d+)\.(\w+)(?:\.\w+)?(?:\.([A-Za-z-]+))
你可以给引擎一些令牌来分割:
(?:(?<=\d)-(?=\d))|(?:\.combined\.)|[.:]+
分解,这表示:
(?:(?<=\d)-(?=\d)) # a dash between numbers
| # or
(?:\.combined\.) # .combined. literally
| # or
[.:]+ # one of . or :
在 R
使用 str_split()
:
library(stringr)
x <- c("chr1:625000-635000.BB_162.Adipose", "chr1:625000-635000.BB_162.combined.HMSC-ad")
str_split(x, '(?:(?<=\d)-(?=\d))|(?:\.combined\.)|[.:]+', simplify = TRUE)
产生
[,1] [,2] [,3] [,4] [,5]
[1,] "chr1" "625000" "635000" "BB_162" "Adipose"
[2,] "chr1" "625000" "635000" "BB_162" "HMSC-ad"
我有以下两个字符串:
x <- "chr1:625000-635000.BB_162.Adipose"
y <- "chr1:625000-635000.BB_162.combined.HMSC-ad"
使用这个正则表达式,我可以毫无问题地捕获 x
> stringr::str_match(x,"(\w+):(\d+)-(\d+)\.(\w+)\.(\w+)")
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"
我想做的是用y
得到这个
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"
使用我当前的正则表达式并申请 y
我得到的是:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.combined" "chr1" "625000" "635000" "BB_162" "combined"
我如何概括我的正则表达式,以便它可以同时处理 x
和 y
?
更新
S.Kalbar,你的正则表达式给出了这个:
> stringr::str_match(y,"(\w+):(\d+)-(\d+)\.(\w+)\.(\w+)(?:\.([A-Za-z-]+))?")
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "combined" "HMSC-ad"
> stringr::str_match(x,"(\w+):(\d+)-(\d+)\.(\w+)\.(\w+)(?:\.([A-Za-z-]+))?")
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" NA
我想得到的是这个 y
:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"
这是 x
:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"
正则表达式:(\w+):(\d+)-(\d+)\.(\w+)(?:\.\w+)?(?:\.([A-Za-z-]+))
你可以给引擎一些令牌来分割:
(?:(?<=\d)-(?=\d))|(?:\.combined\.)|[.:]+
分解,这表示:
(?:(?<=\d)-(?=\d)) # a dash between numbers
| # or
(?:\.combined\.) # .combined. literally
| # or
[.:]+ # one of . or :
在
R
使用 str_split()
:
library(stringr)
x <- c("chr1:625000-635000.BB_162.Adipose", "chr1:625000-635000.BB_162.combined.HMSC-ad")
str_split(x, '(?:(?<=\d)-(?=\d))|(?:\.combined\.)|[.:]+', simplify = TRUE)
产生
[,1] [,2] [,3] [,4] [,5]
[1,] "chr1" "625000" "635000" "BB_162" "Adipose"
[2,] "chr1" "625000" "635000" "BB_162" "HMSC-ad"