stringr：提取包含特定单词的单词

Question

考虑这个简单的例子

dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
                                 'WUFF;WEFF;WIFF2;BIGWIFF'))

> dataframe
# A tibble: 2 x 1
                      text
                     <chr>
1 WAFF;WOFF;WIFF200;WIFF12
2  WUFF;WEFF;WIFF2;BIGWIFF

这里我想提取包含WIFF的单词，也就是说我想得到这样的数据框

> output
# A tibble: 2 x 1
            text
           <chr>
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

我尝试使用

dataframe %>% 
  mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))

但这只会重新调整 NA。有什么想法吗？

谢谢！

Answer 1

您似乎想删除所有包含 WIFF 的词和结尾的 ;（如果有的话）。使用

> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\b(?!\w*WIFF)\w+;?", "")
> dataframe
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

模式 (?i)\b(?!\w*WIFF)\w+;? 匹配：

(?i) - 不区分大小写的内联修饰符
\b - 单词边界
(?!\w*WIFF) - 否定前瞻在任何包含 WIFF 的单词的任何匹配中都失败
\w+ - 1 个或多个单词字符
;? - 可选的 ;（? 匹配它修改的模式的 1 次或 0 次出现）

如果出于某种原因你想使用 str_extract，请注意你的正则表达式无法工作，因为 \bWIFF\b matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\b\w*WIFF\w*\b" 匹配任何包含 WIFF 的单词（不区分大小写）并使用 str_extract_all 以获得多次出现，并且不要忘记将匹配项加入单个 "string":

> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\b\w*WIFF\w*\b")
> res
[[1]]
[1] "WIFF200" "WIFF12" 

[[2]]
[1] "WIFF2"   "BIGWIFF"

> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
            text
1 WIFF200;WIFF12
2  WIFF2;BIGWIFF

您可以通过将 str_extract_all 放入 sapply 函数来 "shrink" 代码，为了更好的可见性，我将它们分开了。

Answer 2

基于 R 的经典非正则表达式方法是，

sapply(strsplit(me$text, ';', fixed = TRUE), function(i) 
                              paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))

#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"

stringr：提取包含特定单词的单词

stringr: extract words containing a specific word

regex

r

stringr