仅删除数字但保留 R 中的“3D”之类的词?

Removing only numbers but keep the words like "3D" in R?

我最近一直在用 R 编码文本挖掘,但我在处理数据预处理方面遇到了麻烦。 我有一个像下面这样的字符串:

"I want to buy 3D printer, but it costs 3000 dollars."

我想保留文字“3D”但去掉“3000”,应该是这样的:

"I want to buy 3D printer, but it costs dollars."

我使用 corpus <- tm_map(corpus, removeNumbers) 但这会删除文本中的所有数字,因此我会在结果中包含术语 "D printer" 但它应该是“3D 打印机”。

有什么办法可以解决这个问题吗?谢谢!

我们可以使用sub

gsub('3\d+\s', '', str1)

如果这需要通用,

gsub('\b\d+\s', '', str1)
#[1] "I want to buy 3D printer, but it costs dollars."

您还可以使用文本分析包,例如 quanteda,它只删除数字,不删除数字。所以在你的情况下:

require(quanteda)
tokenize("I want to buy 3D printer, but it costs 3000 dollars.", removeNumbers = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "I"       "want"    "to"      "buy"     "3D"      "printer" ","       "but"     "it"      "costs"   "dollars" "."      

如果您希望它作为单个字符对象返回,没有标记化(尽管这可能是您的 objective),那么:

paste(tokenize("I want to buy 3D printer, but it costs 3000 dollars.",
               removeNumbers = TRUE, simplify = TRUE, removeSeparators = FALSE), 
      collapse = "")
## [1] "I want to buy 3D printer, but it costs  dollars."