R 删除 txte 中的特定单词，例如：this

Question

txt <- readLines("this.txt")

library(tm)

corpus <- Corpus(VectorSource(txt))

corpus <- tm_map (corpus, removePunctuation)

tdm <- TermDocumentMatrix (corpus)

m <- as.matrix (tdm)

d <- data.frame(freq = sort(rowSums(m),decreasing = TRUE))

Answer 1

很难说出你的数据是什么样子的。但是你可以尝试使用简单的查找替换功能的gsub。

gsub("The", "", "HelloThe")

哪个给你

"Hello"

Answer 2

你知道什么是正则表达式吗？您可以尝试阅读 here 关于 R 函数 gsub 的内容。这是它如何工作的一个小例子：

> let <- c("A", "B", "A", "C") # My vector of letters
> let
[1] "A" "B" "A" "C"
> # I want delete "A", so this letter I will replace with nothing ("")
> l <- gsub("A", "", let) # "A" replace by "" in vector let
> l
[1] ""  "B" ""  "C"

您现在要做的就是删除空元素（如果有的话）。

如果你只有一个符号行，那么 gsub 可以工作：

> let <- " a b c d g h a a a"
> let
[1] " a b c d g h a a a"
> l <- gsub("a", "", let)
> l
[1] "  b c d g h   "

Answer 3

我想您是在问如何使用 tm 库删除 'the' 和 'this' 之类的词？如果是这样，试试这个：

corpus <- tm_map(txt, removeWords, stopwords("english"))

要删除特定字词：

corpus <- tm_map(corpus, removeWords, c("hello","is","it","me","you're","looking","for?"))

编辑：我使用 War 和 Peace 创建了一个示例，该示例有效。在创建文档术语矩阵之前，尝试将术语转换为小写 。像这样：

library(tm) # load txt <- readLines("this.txt") corpus <- Corpus(VectorSource(txt)) # clean corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeNumbers) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, PlainTextDocument) # create dtm and get terms dtm <- DocumentTermMatrix(corpus) dtm$dimnames$Terms

更改代码以适合您的文本文件，输出应与此类似：

dtm$dimnames$Terms [1] "almost" "anonymous" "anyone" "anywhere" "author" "away" [7] "aylmer" "book" "chapter" "contents" "copy" "cost" [13] "date" "david" "ebook" "english" "give" "gutenberg" [19] "iii" "included" "january" "language" "last" "leo" [25] "license" "louise" "march" "maude" "may" "one" [31] "online" "peace" "posting" "project" "restrictions" "reuse" [37] "start" "terms" "title" "tolstoy" "tolstoytolstoi" "translators" [43] "updated" "use" "vii" "volunteer" "war" "whatsoever" [49] "widger" "wwwgutenbergorg"

R 删除 txte 中的特定单词，例如：this

R remove specific word in a txte like: the this

r

tm