提取以下单个单词的字符：

Question

我想提取药物名称，其中“药物：”、“其他：”等位于药物名称之前。取每个“:”之后的第一个单词，包括“-”这样的字符。如果有 2 个 ":" 实例，则 "and" 应将这两个单词连接为一个字符串。 ourpur 应该位于列名为 Drug.

的单列数据框中

这是我的可重现示例：

my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))

输出应如下所示：

output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association"))

这是我试过的方法，但没有用。尝试 1：

str_extract(my.df$col1, '(?<=:\s)(\w+)')

尝试 2：

str_extract(my.df$col1, '(?<=:\s)(\w+)(-)(\w+)')

Answer 1

我对 R 不太熟悉，但是可以为您提供来自示例数据的匹配项的模式可能是：

(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*

然后您可以将匹配项与中间的 and 连接起来。

模式匹配：

(?<=:\s) 正面回顾，断言 : 和左边的空白字符
\w+(?:-\w+)* 匹配 1+ 个单词字符，然后可选地重复 - 和 1+ 个单词字符
(?:非捕获组
- and \w+(?:-\w+)* 匹配 and 后跟 1+ 个单词字符，然后可选地重复 - 和 1+ 个单词字符
)* 关闭非捕获组并可选择重复

Regex demo

要获得所有匹配项，您可以使用str_match_all

str_extract_all(my.df$col1, '(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*')

例如

library(stringr)
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
lapply(
str_extract_all(my.df$col1, '(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*')
, paste, collapse=" and ")

输出

[[1]]
[1] "TLD-1433"

[[2]]
[1] "CG0070 and n-dodecyl-B-D-maltoside"

[[3]]
[1] "Atezolizumab"

[[4]]
[1] "N-803 and BCG and N-803"

[[5]]
[1] "Everolimus and Intravesical"

[[6]]
[1] "Association and Association"

Answer 2

使用

:\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b

见regex proof。

解释

--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    [\w-]+                   any character of: word characters (a-z,
                             A-Z, 0-9, _), '-' (1 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \s+                      whitespace (\n, \r, \t, \f, and " ")
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
      and                      'and'
--------------------------------------------------------------------------------
      \s+                      whitespace (\n, \r, \t, \f, and " ")
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
--------------------------------------------------------------------------------
      [\w-]+                   any character of: word characters (a-
                               z, A-Z, 0-9, _), '-' (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

R code:

my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
library(stringr)
matches <- str_match_all(my.df$col1, ":\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b")
Drugs <- sapply(matches, function(z) paste(z[,-1], collapse=" and "))
output.df <- data.frame(Drugs) 
output.df

结果:

                               Drugs
1                           TLD-1433
2 CG0070 and n-dodecyl-B-D-maltoside
3                       Atezolizumab
4            N-803 and BCG and N-803
5        Everolimus and Intravesical
6        Association and Association

提取以下单个单词的字符：

Extract characters of single word following :

regex

r

regex-lookarounds