仅删除数字但保留 R 中的“3D”之类的词?
Removing only numbers but keep the words like "3D" in R?
我最近一直在用 R 编码文本挖掘,但我在处理数据预处理方面遇到了麻烦。
我有一个像下面这样的字符串:
"I want to buy 3D printer, but it costs 3000 dollars."
我想保留文字“3D”但去掉“3000”,应该是这样的:
"I want to buy 3D printer, but it costs dollars."
我使用 corpus <- tm_map(corpus, removeNumbers)
但这会删除文本中的所有数字,因此我会在结果中包含术语 "D printer" 但它应该是“3D 打印机”。
有什么办法可以解决这个问题吗?谢谢!
我们可以使用sub
gsub('3\d+\s', '', str1)
如果这需要通用,
gsub('\b\d+\s', '', str1)
#[1] "I want to buy 3D printer, but it costs dollars."
您还可以使用文本分析包,例如 quanteda,它只删除数字,不删除数字。所以在你的情况下:
require(quanteda)
tokenize("I want to buy 3D printer, but it costs 3000 dollars.", removeNumbers = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "I" "want" "to" "buy" "3D" "printer" "," "but" "it" "costs" "dollars" "."
如果您希望它作为单个字符对象返回,没有标记化(尽管这可能是您的 objective),那么:
paste(tokenize("I want to buy 3D printer, but it costs 3000 dollars.",
removeNumbers = TRUE, simplify = TRUE, removeSeparators = FALSE),
collapse = "")
## [1] "I want to buy 3D printer, but it costs dollars."
我最近一直在用 R 编码文本挖掘,但我在处理数据预处理方面遇到了麻烦。 我有一个像下面这样的字符串:
"I want to buy 3D printer, but it costs 3000 dollars."
我想保留文字“3D”但去掉“3000”,应该是这样的:
"I want to buy 3D printer, but it costs dollars."
我使用 corpus <- tm_map(corpus, removeNumbers)
但这会删除文本中的所有数字,因此我会在结果中包含术语 "D printer" 但它应该是“3D 打印机”。
有什么办法可以解决这个问题吗?谢谢!
我们可以使用sub
gsub('3\d+\s', '', str1)
如果这需要通用,
gsub('\b\d+\s', '', str1)
#[1] "I want to buy 3D printer, but it costs dollars."
您还可以使用文本分析包,例如 quanteda,它只删除数字,不删除数字。所以在你的情况下:
require(quanteda)
tokenize("I want to buy 3D printer, but it costs 3000 dollars.", removeNumbers = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "I" "want" "to" "buy" "3D" "printer" "," "but" "it" "costs" "dollars" "."
如果您希望它作为单个字符对象返回,没有标记化(尽管这可能是您的 objective),那么:
paste(tokenize("I want to buy 3D printer, but it costs 3000 dollars.",
removeNumbers = TRUE, simplify = TRUE, removeSeparators = FALSE),
collapse = "")
## [1] "I want to buy 3D printer, but it costs dollars."