str_extract 带引号和分号的正则表达式

str_extract regex with quotes and semicolons

我正在使用 R v4.0.0 和 stringi 解析带有分号和引号的长字符串。这是一个示例字符串:

tstr1 <- 'gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; inference "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; partial "true"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'

我想通过首先匹配变量模式 var 然后提取所有内容直到下一个分号来提取带引号的子字符串。我想避免匹配带引号的子字符串中的 var 实例。到目前为止,我有这个:

library(stringi)
library(dplyr)
var <- "partial"
str_extract(string = tstr1, pattern = paste0('"; ', var, '[^;]+')) %>%
    gsub(paste0("\"; ", var), "", .) %>%
    gsub("\"", "", .) %>% trimws()

这个returns"true",这是我想要的输出。但是,我需要一个也适用于两种极端情况的正则表达式:

案例一

var 位于字符串的开头并且我不能依赖前面的 "; 来匹配时。

tstr2 <- 'partial "true"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'

预期输出:"true"

案例二

当要提取的引用子字符串包含分号时,我想匹配所有内容,直到下一个分号 不在引用子字符串 .

tstr3 <- 'partial "true; foo"; gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; infernce "COORDINATES: protein motif:HMM:NF014037.1"; locus_tag "APE_RS08740"; note "incomplete; partial in the middle of a contig; missing N-terminus"; product "DUF5615 family PIN-like protein"; pseudo "true"; transl_table "11"; exon_number "1"'

预期输出:"true; foo"

对于'partial'前面没有任何";的情况,我们可以使用OR(|)条件,然后提取字符两者之间 "

library(stringr)
str_extract(tstr, sprintf('";\s+%1$s[^;]+|^%1$s[^;]+;[^"]+"', var)) %>% 
     trimws(whitespace = '["; ]+', which = 'left') %>% 
      str_extract('(?<=")[^"]+(?=")')

-输出

[1] "true"      "true"      "true; foo"

数据

tstr <- c(tstr1, tstr2, tstr3)