如何通过 tm 包删除带有单词的括号？

Question

假设我在这样的文档中有部分文本：

"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."

我想删除“(API)”，需要先删除

corpus <- tm_map(corpus, removePunctuation)

删除“(API)”后，应该如下所示：

"Other segment comprised of our active pharmaceutical ingredient business,which..."

找了半天也只能找到去掉括号的答案，不想里面的词也出现在语料库里

我真的需要有人给我一些提示。

Answer 1

如果只有一个单词，怎么样（未测试）：

removeBracketed <- content_transformer(function(x, ...) {gsub("\(\w+\)", "", x)})
tm_map(corpus, removeBracketed)

Answer 2

您可以使用更智能的分词器，例如 quanteda 包中的分词器，其中 removePunct = TRUE 会自动删除括号。

quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
##  [1] "Other"          "segment"        "comprised"      "of"             "our"            ## "active"         "pharmaceutical"
##  [8] "ingredient"     "API"            "business"       "which"

已添加：

如果你想先对文本进行分词，那么你需要lapply一个gsub直到我们在[=28=中添加一个正则表达式valuetype到removeFeatures.tokenizedTexts() ]数量。但这行得通：

# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\(.*\)$", toks)]
## [1] "Other"             "segment"           "comprised"         "of"                "our"               "active"           
## [7] "pharmaceutical"    "ingredient"        "business,which..."

如果您只是想删除问题中的括号表达式，那么您不需要 tm 或 quanteda：

# exactly as in the question
gsub("\s(\(\w*\))(\s|[[:punct:]])", "\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."

# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API).  New sentence..."
gsub("\s(\(\w*\))(\s|[[:punct:]])", "\2", txt2)
## [1] "ingredient, business,which..."
gsub("\s(\(\w*\))(\s|[[:punct:]])", "\2", txt3)
## [1] "ingredient.  New sentence..."

较长的正则表达式还捕获括号表达式结束句子或后跟附加标点符号（如逗号）的情况。

如何通过 tm 包删除带有单词的括号？

How to remove parentheses with words inside by tm packages ?

r

punctuation

tm