R 删除 txte 中的特定单词,例如:this

R remove specific word in a txte like: the this

txt <- readLines("this.txt")

library(tm)

corpus <- Corpus(VectorSource(txt))

corpus <- tm_map (corpus, removePunctuation)

tdm <- TermDocumentMatrix (corpus)

m <- as.matrix (tdm)

d <- data.frame(freq = sort(rowSums(m),decreasing = TRUE))

很难说出你的数据是什么样子的。但是你可以尝试使用简单的查找替换功能的gsub。

gsub("The", "", "HelloThe")

哪个给你

"Hello"

你知道什么是正则表达式吗? 您可以尝试阅读 here 关于 R 函数 gsub 的内容。 这是它如何工作的一个小例子:

> let <- c("A", "B", "A", "C") # My vector of letters
> let
[1] "A" "B" "A" "C"
> # I want delete "A", so this letter I will replace with nothing ("")
> l <- gsub("A", "", let) # "A" replace by "" in vector let
> l
[1] ""  "B" ""  "C"

您现在要做的就是删除空元素(如果有的话)。

如果你只有一个符号行,那么 gsub 可以工作:

> let <- " a b c d g h a a a"
> let
[1] " a b c d g h a a a"
> l <- gsub("a", "", let)
> l
[1] "  b c d g h   "

我想您是在问如何使用 tm 库删除 'the' 和 'this' 之类的词?如果是这样,试试这个:

corpus <- tm_map(txt, removeWords, stopwords("english"))

要删除特定字词:

corpus <- tm_map(corpus, removeWords, c("hello","is","it","me","you're","looking","for?"))

编辑:我使用 War 和 Peace 创建了一个示例,该示例有效。在 创建文档术语矩阵之前,尝试将术语转换为小写 。像这样:

library(tm)

# load
txt <- readLines("this.txt")
corpus <- Corpus(VectorSource(txt))

# clean
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english")) 
corpus <- tm_map(corpus, PlainTextDocument)

# create dtm and get terms
dtm <- DocumentTermMatrix(corpus)
dtm$dimnames$Terms

更改代码以适合您的文本文件,输出应与此类似:

dtm$dimnames$Terms
 [1] "almost"          "anonymous"       "anyone"          "anywhere"        "author"          "away"           
 [7] "aylmer"          "book"            "chapter"         "contents"        "copy"            "cost"           
[13] "date"            "david"           "ebook"           "english"         "give"            "gutenberg"      
[19] "iii"             "included"        "january"         "language"        "last"            "leo"            
[25] "license"         "louise"          "march"           "maude"           "may"             "one"            
[31] "online"          "peace"           "posting"         "project"         "restrictions"    "reuse"          
[37] "start"           "terms"           "title"           "tolstoy"         "tolstoytolstoi"  "translators"    
[43] "updated"         "use"             "vii"             "volunteer"       "war"             "whatsoever"     
[49] "widger"          "wwwgutenbergorg"