如何从向量中找到字符串的所有最长匹配项

How to find all longest matches of a string from vector

我有一些字符串,我需要使用 rebus 包从向量中提取所有匹配项。

s <- "Heart disease Heart diseases include Blood vessel disease, such as coronary artery disease Heart rhythm problems arrhythmias Heart defects you're born with congenital heart defects Heart valve disease Disease of the heart muscle Heart infection"

symp.rx <- or1(whole_word(c("Heart", "Heart rythm", "Heart valve")))
stri_extract_all_regex(s, symp.rx)

运行 上面的代码给了我 “心”“心”“心”“心”“心”

我错过了什么?我还需要心律、心脏瓣膜等...

注意:文本实际上是一个大数据框的一列,向量 (symp.rx) 超过 5000 个单词,我需要输出作为数据框每一行的简化向量(在一秒钟内列)。

先传递较长的模式,然后包含较短的模式。

library(rebus)

symp.rx <- or1(whole_word(c("Heart rythm", "Heart valve", "Heart")))

stringi::stri_extract_all_regex(s, symp.rx)
#[1] "Heart"    "Heart"       "Heart"     "Heart"   "Heart valve" "Heart"