stringr:提取包含特定单词的单词
stringr: extract words containing a specific word
考虑这个简单的例子
dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe
# A tibble: 2 x 1
text
<chr>
1 WAFF;WOFF;WIFF200;WIFF12
2 WUFF;WEFF;WIFF2;BIGWIFF
这里我想提取包含WIFF
的单词,也就是说我想得到这样的数据框
> output
# A tibble: 2 x 1
text
<chr>
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
我尝试使用
dataframe %>%
mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))
但这只会重新调整 NA。有什么想法吗?
谢谢!
您似乎想删除所有包含 WIFF
的词和结尾的 ;
(如果有的话)。使用
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\b(?!\w*WIFF)\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
模式 (?i)\b(?!\w*WIFF)\w+;?
匹配:
(?i)
- 不区分大小写的内联修饰符
\b
- 单词边界
(?!\w*WIFF)
- 否定前瞻在任何包含 WIFF
的单词的任何匹配中都失败
\w+
- 1 个或多个单词字符
;?
- 可选的 ;
(?
匹配它修改的模式的 1 次或 0 次出现)
如果出于某种原因你想使用 str_extract
,请注意你的正则表达式无法工作,因为 \bWIFF\b
matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\b\w*WIFF\w*\b"
匹配任何包含 WIFF
的单词(不区分大小写)并使用 str_extract_all
以获得多次出现,并且不要忘记将匹配项加入单个 "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\b\w*WIFF\w*\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
您可以通过将 str_extract_all
放入 sapply
函数来 "shrink" 代码,为了更好的可见性,我将它们分开了。
基于 R 的经典非正则表达式方法是,
sapply(strsplit(me$text, ';', fixed = TRUE), function(i)
paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))
#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"
考虑这个简单的例子
dataframe <- data_frame(text = c('WAFF;WOFF;WIFF200;WIFF12',
'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe
# A tibble: 2 x 1
text
<chr>
1 WAFF;WOFF;WIFF200;WIFF12
2 WUFF;WEFF;WIFF2;BIGWIFF
这里我想提取包含WIFF
的单词,也就是说我想得到这样的数据框
> output
# A tibble: 2 x 1
text
<chr>
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
我尝试使用
dataframe %>%
mutate( mystring = str_extract(text, regex('\bwiff\b', ignore_case=TRUE)))
但这只会重新调整 NA。有什么想法吗?
谢谢!
您似乎想删除所有包含 WIFF
的词和结尾的 ;
(如果有的话)。使用
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\b(?!\w*WIFF)\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
模式 (?i)\b(?!\w*WIFF)\w+;?
匹配:
(?i)
- 不区分大小写的内联修饰符\b
- 单词边界(?!\w*WIFF)
- 否定前瞻在任何包含WIFF
的单词的任何匹配中都失败\w+
- 1 个或多个单词字符;?
- 可选的;
(?
匹配它修改的模式的 1 次或 0 次出现)
如果出于某种原因你想使用 str_extract
,请注意你的正则表达式无法工作,因为 \bWIFF\b
matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\b\w*WIFF\w*\b"
匹配任何包含 WIFF
的单词(不区分大小写)并使用 str_extract_all
以获得多次出现,并且不要忘记将匹配项加入单个 "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\b\w*WIFF\w*\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
您可以通过将 str_extract_all
放入 sapply
函数来 "shrink" 代码,为了更好的可见性,我将它们分开了。
基于 R 的经典非正则表达式方法是,
sapply(strsplit(me$text, ';', fixed = TRUE), function(i)
paste(grep('WIFF', i, value = TRUE, fixed = TRUE), collapse = ';'))
#[1] "WIFF200;WIFF12" "WIFF2;BIGWIFF"