如何通过 tm 包删除带有单词的括号?
How to remove parentheses with words inside by tm packages ?
假设我在这样的文档中有部分文本:
"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."
我想删除“(API)”,需要先删除
corpus <- tm_map(corpus, removePunctuation)
删除“(API)”后,应该如下所示:
"Other segment comprised of our active pharmaceutical ingredient business,which..."
找了半天也只能找到去掉括号的答案,不想里面的词也出现在语料库里
我真的需要有人给我一些提示。
如果只有一个单词,怎么样(未测试):
removeBracketed <- content_transformer(function(x, ...) {gsub("\(\w+\)", "", x)})
tm_map(corpus, removeBracketed)
您可以使用更智能的分词器,例如 quanteda 包中的分词器,其中 removePunct = TRUE
会自动删除括号。
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Other" "segment" "comprised" "of" "our" ## "active" "pharmaceutical"
## [8] "ingredient" "API" "business" "which"
已添加:
如果你想先对文本进行分词,那么你需要lapply
一个gsub
直到我们在[=28=中添加一个正则表达式valuetype
到removeFeatures.tokenizedTexts()
]数量。但这行得通:
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\(.*\)$", toks)]
## [1] "Other" "segment" "comprised" "of" "our" "active"
## [7] "pharmaceutical" "ingredient" "business,which..."
如果您只是想删除问题中的括号表达式,那么您不需要 tm 或 quanteda:
# exactly as in the question
gsub("\s(\(\w*\))(\s|[[:punct:]])", "\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."
# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API). New sentence..."
gsub("\s(\(\w*\))(\s|[[:punct:]])", "\2", txt2)
## [1] "ingredient, business,which..."
gsub("\s(\(\w*\))(\s|[[:punct:]])", "\2", txt3)
## [1] "ingredient. New sentence..."
较长的正则表达式还捕获括号表达式结束句子或后跟附加标点符号(如逗号)的情况。
假设我在这样的文档中有部分文本:
"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."
我想删除“(API)”,需要先删除
corpus <- tm_map(corpus, removePunctuation)
删除“(API)”后,应该如下所示:
"Other segment comprised of our active pharmaceutical ingredient business,which..."
找了半天也只能找到去掉括号的答案,不想里面的词也出现在语料库里
我真的需要有人给我一些提示。
如果只有一个单词,怎么样(未测试):
removeBracketed <- content_transformer(function(x, ...) {gsub("\(\w+\)", "", x)})
tm_map(corpus, removeBracketed)
您可以使用更智能的分词器,例如 quanteda 包中的分词器,其中 removePunct = TRUE
会自动删除括号。
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Other" "segment" "comprised" "of" "our" ## "active" "pharmaceutical"
## [8] "ingredient" "API" "business" "which"
已添加:
如果你想先对文本进行分词,那么你需要lapply
一个gsub
直到我们在[=28=中添加一个正则表达式valuetype
到removeFeatures.tokenizedTexts()
]数量。但这行得通:
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\(.*\)$", toks)]
## [1] "Other" "segment" "comprised" "of" "our" "active"
## [7] "pharmaceutical" "ingredient" "business,which..."
如果您只是想删除问题中的括号表达式,那么您不需要 tm 或 quanteda:
# exactly as in the question
gsub("\s(\(\w*\))(\s|[[:punct:]])", "\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."
# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API). New sentence..."
gsub("\s(\(\w*\))(\s|[[:punct:]])", "\2", txt2)
## [1] "ingredient, business,which..."
gsub("\s(\(\w*\))(\s|[[:punct:]])", "\2", txt3)
## [1] "ingredient. New sentence..."
较长的正则表达式还捕获括号表达式结束句子或后跟附加标点符号(如逗号)的情况。