提取以下单个单词的字符:
Extract characters of single word following :
我想提取药物名称,其中“药物:”、“其他:”等位于药物名称之前。
取每个“:”之后的第一个单词,包括“-”这样的字符。
如果有 2 个 ":" 实例,则 "and" 应将这两个单词连接为一个字符串。 ourpur 应该位于列名为 Drug.
的单列数据框中
这是我的可重现示例:
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
输出应如下所示:
output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association"))
这是我试过的方法,但没有用。
尝试 1:
str_extract(my.df$col1, '(?<=:\s)(\w+)')
尝试 2:
str_extract(my.df$col1, '(?<=:\s)(\w+)(-)(\w+)')
我对 R 不太熟悉,但是可以为您提供来自示例数据的匹配项的模式可能是:
(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*
然后您可以将匹配项与中间的 and
连接起来。
模式匹配:
(?<=:\s)
正面回顾,断言 :
和左边的空白字符
\w+(?:-\w+)*
匹配 1+ 个单词字符,然后可选地重复 -
和 1+ 个单词字符
(?:
非捕获组
and \w+(?:-\w+)*
匹配 and
后跟 1+ 个单词字符,然后可选地重复 -
和 1+ 个单词字符
)*
关闭非捕获组并可选择重复
要获得所有匹配项,您可以使用str_match_all
str_extract_all(my.df$col1, '(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*')
例如
library(stringr)
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
lapply(
str_extract_all(my.df$col1, '(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*')
, paste, collapse=" and ")
输出
[[1]]
[1] "TLD-1433"
[[2]]
[1] "CG0070 and n-dodecyl-B-D-maltoside"
[[3]]
[1] "Atezolizumab"
[[4]]
[1] "N-803 and BCG and N-803"
[[5]]
[1] "Everolimus and Intravesical"
[[6]]
[1] "Association and Association"
使用
:\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b
解释
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
[\w-]+ any character of: word characters (a-z,
A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
[\w-]+ any character of: word characters (a-
z, A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
library(stringr)
matches <- str_match_all(my.df$col1, ":\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b")
Drugs <- sapply(matches, function(z) paste(z[,-1], collapse=" and "))
output.df <- data.frame(Drugs)
output.df
结果:
Drugs
1 TLD-1433
2 CG0070 and n-dodecyl-B-D-maltoside
3 Atezolizumab
4 N-803 and BCG and N-803
5 Everolimus and Intravesical
6 Association and Association
我想提取药物名称,其中“药物:”、“其他:”等位于药物名称之前。 取每个“:”之后的第一个单词,包括“-”这样的字符。 如果有 2 个 ":" 实例,则 "and" 应将这两个单词连接为一个字符串。 ourpur 应该位于列名为 Drug.
的单列数据框中这是我的可重现示例:
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
输出应如下所示:
output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association"))
这是我试过的方法,但没有用。 尝试 1:
str_extract(my.df$col1, '(?<=:\s)(\w+)')
尝试 2:
str_extract(my.df$col1, '(?<=:\s)(\w+)(-)(\w+)')
我对 R 不太熟悉,但是可以为您提供来自示例数据的匹配项的模式可能是:
(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*
然后您可以将匹配项与中间的 and
连接起来。
模式匹配:
(?<=:\s)
正面回顾,断言:
和左边的空白字符\w+(?:-\w+)*
匹配 1+ 个单词字符,然后可选地重复-
和 1+ 个单词字符(?:
非捕获组and \w+(?:-\w+)*
匹配and
后跟 1+ 个单词字符,然后可选地重复-
和 1+ 个单词字符
)*
关闭非捕获组并可选择重复
要获得所有匹配项,您可以使用str_match_all
str_extract_all(my.df$col1, '(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*')
例如
library(stringr)
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
lapply(
str_extract_all(my.df$col1, '(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*')
, paste, collapse=" and ")
输出
[[1]]
[1] "TLD-1433"
[[2]]
[1] "CG0070 and n-dodecyl-B-D-maltoside"
[[3]]
[1] "Atezolizumab"
[[4]]
[1] "N-803 and BCG and N-803"
[[5]]
[1] "Everolimus and Intravesical"
[[6]]
[1] "Association and Association"
使用
:\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b
解释
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
[\w-]+ any character of: word characters (a-z,
A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
[\w-]+ any character of: word characters (a-
z, A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
library(stringr)
matches <- str_match_all(my.df$col1, ":\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b")
Drugs <- sapply(matches, function(z) paste(z[,-1], collapse=" and "))
output.df <- data.frame(Drugs)
output.df
结果:
Drugs
1 TLD-1433
2 CG0070 and n-dodecyl-B-D-maltoside
3 Atezolizumab
4 N-803 and BCG and N-803
5 Everolimus and Intravesical
6 Association and Association