如何提取字符串中感兴趣的值

How to extract the values of interest in a string

在我的数据框的一列中,同一对象有多个命名。

例如,假设我正在研究多种癌症。每种癌症都有几个子规范。

type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR+ / HER2 -)")

dat <- as.data.frame(type)
dat

所以我们有:

Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)                
Breast (ER- / PR- / HER2- / AR - / EGFR -)              
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)               
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)                
Breast (ER- / PR- / HER2- / PDL1 - / AR -)              
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))             
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2- / PD-L1 -)                
Breast (ER- / PR+ / HER2 -)

它可能看起来不像,但我们这里只有两种不同类型的癌症,它们是 Breast (ER- / PR- / HER2-)Breast (ER- / PR+ / HER2-)

当然我还有很多行,这只是一个减法,所以我想开发一个函数,让我可以统计我有多少种类型,即关注ER,PR和HER2 值。

为此,我虽然创建了一个函数来捕获由 Breast (ER\s+ PR\s+HER2\s 组成的字符串,其中 \s 是任何可能的值(将它们分开的原因是,正如您所看到的,这三个值并不总是相互遵循)。

但我没有找到使用 gsub 执行此操作的方法。

编辑:

最后我想得到另一列如下所示:

Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR- / HER2-)              
Breast (ER- / PR+ / HER2-)

这将允许我使用 unique() 函数进行计数

stringr 包裹是你的朋友

Stringr 是一个包,它是 tidyverse 的一部分,提供了许多有用的包装器,使处理字符串更加直观。

这里需要几个步骤,所以我将向您展示每个步骤后的输出

type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR+ / HER2 -)")

dat <- as.data.frame(type)
intersect, setdiff, setequal, union

## Get rid of all spaces
mutate(dat, simple_type = str_remove_all(type, "[:space:]")) |> 
  pull(simple_type) # just using pull to show you where we're up to with the process
#> [1] "Breast(ER-/PR-/EGFR-/AR-/PD-L1-/HER2-)"            
#> [2] "Breast(ER-/PR-/HER2-/AR-/EGFR-)"                   
#> [3] "Breast(ER-/PR-/HER2-/BRCA-/PDL11%/FGFR-)"          
#> [4] "Breast(ER-/PR-/HER2-/BRCA-/PDL12%)"                
#> [5] "Breast(ER-/PR-/HER2-/PDL1-/AR-)"                   
#> [6] "Breast(ER-/PR-/HER2-/PD-L150%(BreastandIC5%liver))"
#> [7] "Breast(ER-/PR-/HER2-)"                             
#> [8] "Breast(ER-/PR-/HER2-/PD-L1-)"                      
#> [9] "Breast(ER-/PR+/HER2-)"

## Extract a list of the codes we're interested it
mutate(dat,
       simple_type = str_remove_all(type, "[:space:]") |>
         str_extract_all("(ER[+-])|(PR[+-])|(HER2[+-])")) |>  ## extract all instances of 'ER' and one of +/-, OR PR and one of +/-, etc.
  pull(simple_type)
#> [[1]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[2]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[3]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[4]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[5]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[6]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[7]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[8]]
#> [1] "ER-"   "PR-"   "HER2-"
#> 
#> [[9]]
#> [1] "ER-"   "PR+"   "HER2-"

## Collapse each list element into a single string, then turn the list into a character vector
### (saving this new df as 'dat' because it makes the next step much easier to write)
dat <- 
  mutate(dat,
       simple_type = str_remove_all(type, "[:space:]") |>
         str_extract_all("(ER[+-])|(PR[+-])|(HER2[+-])") |>
         lapply(str_c, collapse = " / ") |> # stringr::str_c() is pretty much identical to base::paste()
         as.character())
dat[["simple_type"]]
#> [1] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [4] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [7] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR+ / HER2-"

## Paste back in the other stuff
dat <- mutate(dat, simple_type = str_c("Breast (", simple_type, ")"))
dat[["simple_type"]]
#> [1] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [3] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [5] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [7] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [9] "Breast (ER- / PR+ / HER2-)"

reprex package (v2.0.1)

于 2022-05-27 创建

如果我没看错的话,癌症类型可以通过“PR”后面的+/-来明确区分。那么,这会是您的选择吗?

注意。可能,您处理模式的方式(占whitespaces/typos)需要更高级。我不擅长正则表达式。

library(stringr)
df$type <- as.factor(str_sub(df$cancer, 14, 17))
table(df$type)
#> 
#>  PR-  PR+ 
#>    8    1

reprex package (v2.0.1)

于 2022-05-27 创建

数据

df <- data.frame(cancer = c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)",                
                            "Breast (ER- / PR- / HER2- / AR - / EGFR -)",              
                            "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)",               
                            "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)",                
                            "Breast (ER- / PR- / HER2- / PDL1 - / AR -)",             
                            "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))",            
                            "Breast (ER- / PR- / HER2-)",            
                            "Breast (ER- / PR- / HER2- / PD-L1 -)",               
                            "Breast (ER- / PR+ / HER2 -)")
)

这是 stringr 方法的变体。首先,提取 ER、PR 和 HER2 组件,包括符号 (+/-)。以 across() 开头的行删除了文本和符号之间的空格。然后将三个组件放在一列中。

dat <- read.delim(text = text)

library(tidyverse)

dat |> 
  mutate(er = str_extract(type, "ER\s*[+-]"),
         pr = str_extract(type, "PR\s*[+-]"),
         her2 = str_extract(type, "HER2\s*[+-]"),
         across(c(er, pr, her2), ~ str_remove(., "\s")),
         type1 = paste0("Breast (", er, " / ", pr, " / ", her2, ")")) |> 
  select(-c(er, pr, her2))
 
#                                                             type                      type1
#           Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-) Breast (ER- / PR- / HER2-)
#                     Breast (ER- / PR- / HER2- / AR - / EGFR -) Breast (ER- / PR- / HER2-)
#          Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -) Breast (ER- / PR- / HER2-)
#                   Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%) Breast (ER- / PR- / HER2-)
#                     Breast (ER- / PR- / HER2- / PDL1 - / AR -) Breast (ER- / PR- / HER2-)
#Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver)) Breast (ER- / PR- / HER2-)
#                                     Breast (ER- / PR- / HER2-) Breast (ER- / PR- / HER2-)
#                           Breast (ER- / PR- / HER2- / PD-L1 -) Breast (ER- / PR- / HER2-)
#                                    Breast (ER- / PR+ / HER2 -) Breast (ER- / PR+ / HER2-)

数据:

text <- "type
Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)
Breast (ER- / PR- / HER2- / AR - / EGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)
Breast (ER- / PR- / HER2- / PDL1 - / AR -)
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2- / PD-L1 -)
Breast (ER- / PR+ / HER2 -)"