如何提取字符串中感兴趣的值
How to extract the values of interest in a string
在我的数据框的一列中,同一对象有多个命名。
例如,假设我正在研究多种癌症。每种癌症都有几个子规范。
type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR+ / HER2 -)")
dat <- as.data.frame(type)
dat
所以我们有:
Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)
Breast (ER- / PR- / HER2- / AR - / EGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)
Breast (ER- / PR- / HER2- / PDL1 - / AR -)
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2- / PD-L1 -)
Breast (ER- / PR+ / HER2 -)
它可能看起来不像,但我们这里只有两种不同类型的癌症,它们是 Breast (ER- / PR- / HER2-)
和 Breast (ER- / PR+ / HER2-)
。
当然我还有很多行,这只是一个减法,所以我想开发一个函数,让我可以统计我有多少种类型,即关注ER,PR和HER2 值。
为此,我虽然创建了一个函数来捕获由 Breast (ER\s
+ PR\s
+HER2\s
组成的字符串,其中 \s
是任何可能的值(将它们分开的原因是,正如您所看到的,这三个值并不总是相互遵循)。
但我没有找到使用 gsub 执行此操作的方法。
编辑:
最后我想得到另一列如下所示:
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR+ / HER2-)
这将允许我使用 unique()
函数进行计数
stringr
包裹是你的朋友
Stringr 是一个包,它是 tidyverse 的一部分,提供了许多有用的包装器,使处理字符串更加直观。
这里需要几个步骤,所以我将向您展示每个步骤后的输出
type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR+ / HER2 -)")
dat <- as.data.frame(type)
intersect, setdiff, setequal, union
## Get rid of all spaces
mutate(dat, simple_type = str_remove_all(type, "[:space:]")) |>
pull(simple_type) # just using pull to show you where we're up to with the process
#> [1] "Breast(ER-/PR-/EGFR-/AR-/PD-L1-/HER2-)"
#> [2] "Breast(ER-/PR-/HER2-/AR-/EGFR-)"
#> [3] "Breast(ER-/PR-/HER2-/BRCA-/PDL11%/FGFR-)"
#> [4] "Breast(ER-/PR-/HER2-/BRCA-/PDL12%)"
#> [5] "Breast(ER-/PR-/HER2-/PDL1-/AR-)"
#> [6] "Breast(ER-/PR-/HER2-/PD-L150%(BreastandIC5%liver))"
#> [7] "Breast(ER-/PR-/HER2-)"
#> [8] "Breast(ER-/PR-/HER2-/PD-L1-)"
#> [9] "Breast(ER-/PR+/HER2-)"
## Extract a list of the codes we're interested it
mutate(dat,
simple_type = str_remove_all(type, "[:space:]") |>
str_extract_all("(ER[+-])|(PR[+-])|(HER2[+-])")) |> ## extract all instances of 'ER' and one of +/-, OR PR and one of +/-, etc.
pull(simple_type)
#> [[1]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[2]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[3]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[4]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[5]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[6]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[7]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[8]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[9]]
#> [1] "ER-" "PR+" "HER2-"
## Collapse each list element into a single string, then turn the list into a character vector
### (saving this new df as 'dat' because it makes the next step much easier to write)
dat <-
mutate(dat,
simple_type = str_remove_all(type, "[:space:]") |>
str_extract_all("(ER[+-])|(PR[+-])|(HER2[+-])") |>
lapply(str_c, collapse = " / ") |> # stringr::str_c() is pretty much identical to base::paste()
as.character())
dat[["simple_type"]]
#> [1] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [4] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [7] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR+ / HER2-"
## Paste back in the other stuff
dat <- mutate(dat, simple_type = str_c("Breast (", simple_type, ")"))
dat[["simple_type"]]
#> [1] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [3] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [5] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [7] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [9] "Breast (ER- / PR+ / HER2-)"
由 reprex package (v2.0.1)
于 2022-05-27 创建
如果我没看错的话,癌症类型可以通过“PR”后面的+/-来明确区分。那么,这会是您的选择吗?
注意。可能,您处理模式的方式(占whitespaces/typos)需要更高级。我不擅长正则表达式。
library(stringr)
df$type <- as.factor(str_sub(df$cancer, 14, 17))
table(df$type)
#>
#> PR- PR+
#> 8 1
由 reprex package (v2.0.1)
于 2022-05-27 创建
数据
df <- data.frame(cancer = c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)",
"Breast (ER- / PR- / HER2- / AR - / EGFR -)",
"Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)",
"Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)",
"Breast (ER- / PR- / HER2- / PDL1 - / AR -)",
"Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))",
"Breast (ER- / PR- / HER2-)",
"Breast (ER- / PR- / HER2- / PD-L1 -)",
"Breast (ER- / PR+ / HER2 -)")
)
这是 stringr
方法的变体。首先,提取 ER、PR 和 HER2 组件,包括符号 (+/-)。以 across()
开头的行删除了文本和符号之间的空格。然后将三个组件放在一列中。
dat <- read.delim(text = text)
library(tidyverse)
dat |>
mutate(er = str_extract(type, "ER\s*[+-]"),
pr = str_extract(type, "PR\s*[+-]"),
her2 = str_extract(type, "HER2\s*[+-]"),
across(c(er, pr, her2), ~ str_remove(., "\s")),
type1 = paste0("Breast (", er, " / ", pr, " / ", her2, ")")) |>
select(-c(er, pr, her2))
# type type1
# Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / AR - / EGFR -) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / PDL1 - / AR -) Breast (ER- / PR- / HER2-)
#Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver)) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2-) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / PD-L1 -) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR+ / HER2 -) Breast (ER- / PR+ / HER2-)
数据:
text <- "type
Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)
Breast (ER- / PR- / HER2- / AR - / EGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)
Breast (ER- / PR- / HER2- / PDL1 - / AR -)
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2- / PD-L1 -)
Breast (ER- / PR+ / HER2 -)"
在我的数据框的一列中,同一对象有多个命名。
例如,假设我正在研究多种癌症。每种癌症都有几个子规范。
type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR+ / HER2 -)")
dat <- as.data.frame(type)
dat
所以我们有:
Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)
Breast (ER- / PR- / HER2- / AR - / EGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)
Breast (ER- / PR- / HER2- / PDL1 - / AR -)
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2- / PD-L1 -)
Breast (ER- / PR+ / HER2 -)
它可能看起来不像,但我们这里只有两种不同类型的癌症,它们是 Breast (ER- / PR- / HER2-)
和 Breast (ER- / PR+ / HER2-)
。
当然我还有很多行,这只是一个减法,所以我想开发一个函数,让我可以统计我有多少种类型,即关注ER,PR和HER2 值。
为此,我虽然创建了一个函数来捕获由 Breast (ER\s
+ PR\s
+HER2\s
组成的字符串,其中 \s
是任何可能的值(将它们分开的原因是,正如您所看到的,这三个值并不总是相互遵循)。
但我没有找到使用 gsub 执行此操作的方法。
编辑:
最后我想得到另一列如下所示:
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2-)
Breast (ER- / PR+ / HER2-)
这将允许我使用 unique()
函数进行计数
stringr
包裹是你的朋友
Stringr 是一个包,它是 tidyverse 的一部分,提供了许多有用的包装器,使处理字符串更加直观。
这里需要几个步骤,所以我将向您展示每个步骤后的输出
type <- c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)", "Breast (ER- / PR- / HER2- / AR - / EGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)", "Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)", "Breast (ER- / PR- / HER2- / PDL1 - / AR -)", "Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))", "Breast (ER- / PR- / HER2-)", "Breast (ER- / PR- / HER2- / PD-L1 -)", "Breast (ER- / PR+ / HER2 -)")
dat <- as.data.frame(type)
intersect, setdiff, setequal, union
## Get rid of all spaces
mutate(dat, simple_type = str_remove_all(type, "[:space:]")) |>
pull(simple_type) # just using pull to show you where we're up to with the process
#> [1] "Breast(ER-/PR-/EGFR-/AR-/PD-L1-/HER2-)"
#> [2] "Breast(ER-/PR-/HER2-/AR-/EGFR-)"
#> [3] "Breast(ER-/PR-/HER2-/BRCA-/PDL11%/FGFR-)"
#> [4] "Breast(ER-/PR-/HER2-/BRCA-/PDL12%)"
#> [5] "Breast(ER-/PR-/HER2-/PDL1-/AR-)"
#> [6] "Breast(ER-/PR-/HER2-/PD-L150%(BreastandIC5%liver))"
#> [7] "Breast(ER-/PR-/HER2-)"
#> [8] "Breast(ER-/PR-/HER2-/PD-L1-)"
#> [9] "Breast(ER-/PR+/HER2-)"
## Extract a list of the codes we're interested it
mutate(dat,
simple_type = str_remove_all(type, "[:space:]") |>
str_extract_all("(ER[+-])|(PR[+-])|(HER2[+-])")) |> ## extract all instances of 'ER' and one of +/-, OR PR and one of +/-, etc.
pull(simple_type)
#> [[1]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[2]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[3]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[4]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[5]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[6]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[7]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[8]]
#> [1] "ER-" "PR-" "HER2-"
#>
#> [[9]]
#> [1] "ER-" "PR+" "HER2-"
## Collapse each list element into a single string, then turn the list into a character vector
### (saving this new df as 'dat' because it makes the next step much easier to write)
dat <-
mutate(dat,
simple_type = str_remove_all(type, "[:space:]") |>
str_extract_all("(ER[+-])|(PR[+-])|(HER2[+-])") |>
lapply(str_c, collapse = " / ") |> # stringr::str_c() is pretty much identical to base::paste()
as.character())
dat[["simple_type"]]
#> [1] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [4] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR- / HER2-"
#> [7] "ER- / PR- / HER2-" "ER- / PR- / HER2-" "ER- / PR+ / HER2-"
## Paste back in the other stuff
dat <- mutate(dat, simple_type = str_c("Breast (", simple_type, ")"))
dat[["simple_type"]]
#> [1] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [3] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [5] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [7] "Breast (ER- / PR- / HER2-)" "Breast (ER- / PR- / HER2-)"
#> [9] "Breast (ER- / PR+ / HER2-)"
由 reprex package (v2.0.1)
于 2022-05-27 创建如果我没看错的话,癌症类型可以通过“PR”后面的+/-来明确区分。那么,这会是您的选择吗?
注意。可能,您处理模式的方式(占whitespaces/typos)需要更高级。我不擅长正则表达式。
library(stringr)
df$type <- as.factor(str_sub(df$cancer, 14, 17))
table(df$type)
#>
#> PR- PR+
#> 8 1
由 reprex package (v2.0.1)
于 2022-05-27 创建数据
df <- data.frame(cancer = c("Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)",
"Breast (ER- / PR- / HER2- / AR - / EGFR -)",
"Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)",
"Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)",
"Breast (ER- / PR- / HER2- / PDL1 - / AR -)",
"Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))",
"Breast (ER- / PR- / HER2-)",
"Breast (ER- / PR- / HER2- / PD-L1 -)",
"Breast (ER- / PR+ / HER2 -)")
)
这是 stringr
方法的变体。首先,提取 ER、PR 和 HER2 组件,包括符号 (+/-)。以 across()
开头的行删除了文本和符号之间的空格。然后将三个组件放在一列中。
dat <- read.delim(text = text)
library(tidyverse)
dat |>
mutate(er = str_extract(type, "ER\s*[+-]"),
pr = str_extract(type, "PR\s*[+-]"),
her2 = str_extract(type, "HER2\s*[+-]"),
across(c(er, pr, her2), ~ str_remove(., "\s")),
type1 = paste0("Breast (", er, " / ", pr, " / ", her2, ")")) |>
select(-c(er, pr, her2))
# type type1
# Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / AR - / EGFR -) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / PDL1 - / AR -) Breast (ER- / PR- / HER2-)
#Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver)) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2-) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR- / HER2- / PD-L1 -) Breast (ER- / PR- / HER2-)
# Breast (ER- / PR+ / HER2 -) Breast (ER- / PR+ / HER2-)
数据:
text <- "type
Breast (ER- / PR- / EGFR - / AR - / PD-L1 - / HER2-)
Breast (ER- / PR- / HER2- / AR - / EGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 1% / FGFR -)
Breast (ER- / PR- / HER2- / BRCA- / PDL1 2%)
Breast (ER- / PR- / HER2- / PDL1 - / AR -)
Breast (ER- / PR- / HER2- / PD-L1 50% (Breast and IC 5% liver))
Breast (ER- / PR- / HER2-)
Breast (ER- / PR- / HER2- / PD-L1 -)
Breast (ER- / PR+ / HER2 -)"