在同一个字符串中获取不同的模式

Question

我得到了以下数据框：

structure(list(Nombre.HIC.HGVS = c("NC_000007.13:g.87178626C>T", 
"NC_000007.13:g.87278760G>A", "NC_000012.11:g.22063115A>G"), 
    Gen = c("ABCB1", "ABCB1", "ABCB1"), Isoforma = c("NM_000927.4", 
    "NM_000927.4", "NM_000927.4"), Nombre.genómico = c("g.87178626C>T", 
    "g.87278760G>A", "g.87133993T>C"), Nombre.cDNA = c("c.1725+38G>A", 
    "c.-330-48366C>T", "c.3637-228A>G"), rs = c("rs2235013", 
    "rs10267099", "rs1186746"), Zona..Inicial..Final. = c("Intrón: 15 / -", 
    "Intrón: 1 / -", "Intrón: 28 / -"), Count.... = c("78,947%", 
    "10,526%", "7,895%"), Count..total. = c("(30 / 38)", "(4 / 38)", 
    "(3 / 38)"), Count.HiC.... = c("77,652%", "7,961%", "3,544%"
    ), Profundidad.de.lectura..DP. = c(310L, 406L, 27L), Profundidad.de.lectura.corregida.por.calidad..DP.QUAL. = c(403L, 
    283L, 20L), Frecuencia.del.alelo.alternativo.en.las.lecturas..FREQ.ALT. = c(50.62, 
    99.65, 40), Calidad.en.la.identificación.de.la.variante..Qual. = c(255L, 
    255L, 95L), Cigosidad..AF1. = c("Heterocigosis", "Homocigosis", 
    "Heterocigosis")), row.names = c(NA, 3L), class = "data.frame")

我需要在两个不同的列中提取我在 [] 之间以粗体突出显示的数字。例如：

NC_000007.13:g.87278760G>A ->> 我想从这里开始：NC_00000[7].13:g .[87278760]G>A NC_000012.11:g.22063115A>G ->> 我想从这里开始：NC_0000[12].11:g.[22063115]A>G

所以基本上，我想保留第一个点之前的最后一个数字（或第一个点之前的最后两个数字 - 只要倒数第二个不是 0- ），以及“g”之后的所有数字."

我一直在使用 stringr 包，但这些条件对我来说太多了。

有什么想法吗？

谢谢！

Answer 1

这是一个可能的解决方案：

library(tidyverse)
df <- structure(list(Nombre.HIC.HGVS = c("NC_000007.13:g.87178626C>T", 
                                         "NC_000007.13:g.87278760G>A", "NC_000013.11:g.22063115A>G"), 
                     Gen = c("ABCB1", "ABCB1", "ABCB1"), Isoforma = c("NM_000927.4", 
                                                                      "NM_000927.4", "NM_000927.4"), Nombre.genómico = c("g.87178626C>T", 
                                                                                                                         "g.87278760G>A", "g.87133993T>C"), Nombre.cDNA = c("c.1725+38G>A", 
                                                                                                                                                                            "c.-330-48366C>T", "c.3637-228A>G"), rs = c("rs2235013", 
                                                                                                                                                                                                                        "rs10267099", "rs1186746"), Zona..Inicial..Final. = c("Intrón: 15 / -", 
                                                                                                                                                                                                                                                                              "Intrón: 1 / -", "Intrón: 28 / -"), Count.... = c("78,947%", 
                                                                                                                                                                                                                                                                                                                                "10,526%", "7,895%"), Count..total. = c("(30 / 38)", "(4 / 38)", 
                                                                                                                                                                                                                                                                                                                                                                        "(3 / 38)"), Count.HiC.... = c("77,652%", "7,961%", "3,544%"
                                                                                                                                                                                                                                                                                                                                                                        ), Profundidad.de.lectura..DP. = c(310L, 406L, 27L), Profundidad.de.lectura.corregida.por.calidad..DP.QUAL. = c(403L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        283L, 20L), Frecuencia.del.alelo.alternativo.en.las.lecturas..FREQ.ALT. = c(50.62, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    99.65, 40), Calidad.en.la.identificación.de.la.variante..Qual. = c(255L, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       255L, 95L), Cigosidad..AF1. = c("Heterocigosis", "Homocigosis", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       "Heterocigosis")), row.names = c(NA, 3L), class = "data.frame")
df %>%
  select(Nombre.HIC.HGVS) %>%
  mutate(first = gsub(x = str_extract(Nombre.HIC.HGVS, "\d+"), pattern = "^0+", ""),
         second = str_extract(Nombre.HIC.HGVS, "(?<=g.)\d+(?=[[:alpha:]]+)"))
#>              Nombre.HIC.HGVS first   second
#> 1 NC_000007.13:g.87178626C>T     7 87178626
#> 2 NC_000007.13:g.87278760G>A     7 87278760
#> 3 NC_000013.11:g.22063115A>G    13 22063115

^{由 reprex package (v2.0.1)}

于 2022-03-03 创建

在同一个字符串中获取不同的模式

getting different patterns within the same string

regex

r

stringr