stringr:提取包含特定单词的单词

stringr: extract words containing a specific word

考虑这个简单的例子

dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
                                 'WUFF;WEFF;WIFF2;BIGWIFF'))

> dataframe
# A tibble: 2 x 1
                      text
                     <chr>
1 WAFF;WOFF;WIFF200;WIFF12
2  WUFF;WEFF;WIFF2;BIGWIFF

这里我想提取包含WIFF的单词,也就是说我想得到这样的数据框

> output
# A tibble: 2 x 1
            text
           <chr>
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

我尝试使用

dataframe %>% 
  mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))

但这只会重新调整 NA。有什么想法吗?

谢谢!

您似乎想删除所有包含 WIFF 的词和结尾的 ;(如果有的话)。使用

> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\b(?!\w*WIFF)\w+;?", "")
> dataframe
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

模式 (?i)\b(?!\w*WIFF)\w+;? 匹配:

  • (?i) - 不区分大小写的内联修饰符
  • \b - 单词边界
  • (?!\w*WIFF) - 否定前瞻在任何包含 WIFF 的单词的任何匹配中都失败
  • \w+ - 1 个或多个单词字符
  • ;? - 可选的 ;? 匹配它修改的模式的 1 次或 0 次出现)

如果出于某种原因你想使用 str_extract,请注意你的正则表达式无法工作,因为 \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\b\w*WIFF\w*\b" 匹配任何包含 WIFF 的单词(不区分大小写)并使用 str_extract_all 以获得多次出现,并且不要忘记将匹配项加入单个 "string":

> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\b\w*WIFF\w*\b")
> res
[[1]]
[1] "WIFF200" "WIFF12" 

[[2]]
[1] "WIFF2"   "BIGWIFF"

> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

您可以通过将 str_extract_all 放入 sapply 函数来 "shrink" 代码,为了更好的可见性,我将它们分开了。

基于 R 的经典非正则表达式方法是,

sapply(strsplit(me$text, ';', fixed = TRUE), function(i) 
                              paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))

#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"