str_extract 带引号和分号的正则表达式
str_extract regex with quotes and semicolons
我正在使用 R v4.0.0 和 stringi
解析带有分号和引号的长字符串。这是一个示例字符串:
tstr1 <- 'gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; inference "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; partial "true"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'
我想通过首先匹配变量模式 var
然后提取所有内容直到下一个分号来提取带引号的子字符串。我想避免匹配带引号的子字符串中的 var
实例。到目前为止,我有这个:
library(stringi)
library(dplyr)
var <- "partial"
str_extract(string = tstr1, pattern = paste0('"; ', var, '[^;]+')) %>%
gsub(paste0("\"; ", var), "", .) %>%
gsub("\"", "", .) %>% trimws()
这个returns"true"
,这是我想要的输出。但是,我需要一个也适用于两种极端情况的正则表达式:
案例一
当 var
位于字符串的开头并且我不能依赖前面的 ";
来匹配时。
tstr2 <- 'partial "true"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'
预期输出:"true"
案例二
当要提取的引用子字符串包含分号时,我想匹配所有内容,直到下一个分号 不在引用子字符串 .
内
tstr3 <- 'partial "true; foo"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'
预期输出:"true; foo"
对于'partial'前面没有任何"
或;
的情况,我们可以使用OR(|
)条件,然后提取字符两者之间 "
library(stringr)
str_extract(tstr, sprintf('";\s+%1$s[^;]+|^%1$s[^;]+;[^"]+"', var)) %>%
trimws(whitespace = '["; ]+', which = 'left') %>%
str_extract('(?<=")[^"]+(?=")')
-输出
[1] "true" "true" "true; foo"
数据
tstr <- c(tstr1, tstr2, tstr3)
我正在使用 R v4.0.0 和 stringi
解析带有分号和引号的长字符串。这是一个示例字符串:
tstr1 <- 'gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; inference "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; partial "true"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'
我想通过首先匹配变量模式 var
然后提取所有内容直到下一个分号来提取带引号的子字符串。我想避免匹配带引号的子字符串中的 var
实例。到目前为止,我有这个:
library(stringi)
library(dplyr)
var <- "partial"
str_extract(string = tstr1, pattern = paste0('"; ', var, '[^;]+')) %>%
gsub(paste0("\"; ", var), "", .) %>%
gsub("\"", "", .) %>% trimws()
这个returns"true"
,这是我想要的输出。但是,我需要一个也适用于两种极端情况的正则表达式:
案例一
当 var
位于字符串的开头并且我不能依赖前面的 ";
来匹配时。
tstr2 <- 'partial "true"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'
预期输出:"true"
案例二
当要提取的引用子字符串包含分号时,我想匹配所有内容,直到下一个分号 不在引用子字符串 .
内tstr3 <- 'partial "true; foo"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'
预期输出:"true; foo"
对于'partial'前面没有任何"
或;
的情况,我们可以使用OR(|
)条件,然后提取字符两者之间 "
library(stringr)
str_extract(tstr, sprintf('";\s+%1$s[^;]+|^%1$s[^;]+;[^"]+"', var)) %>%
trimws(whitespace = '["; ]+', which = 'left') %>%
str_extract('(?<=")[^"]+(?=")')
-输出
[1] "true" "true" "true; foo"
数据
tstr <- c(tstr1, tstr2, tstr3)