R 删除 txte 中的特定单词,例如:this
R remove specific word in a txte like: the this
txt <- readLines("this.txt")
library(tm)
corpus <- Corpus(VectorSource(txt))
corpus <- tm_map (corpus, removePunctuation)
tdm <- TermDocumentMatrix (corpus)
m <- as.matrix (tdm)
d <- data.frame(freq = sort(rowSums(m),decreasing = TRUE))
很难说出你的数据是什么样子的。但是你可以尝试使用简单的查找替换功能的gsub。
gsub("The", "", "HelloThe")
哪个给你
"Hello"
你知道什么是正则表达式吗?
您可以尝试阅读 here 关于 R 函数 gsub 的内容。
这是它如何工作的一个小例子:
> let <- c("A", "B", "A", "C") # My vector of letters
> let
[1] "A" "B" "A" "C"
> # I want delete "A", so this letter I will replace with nothing ("")
> l <- gsub("A", "", let) # "A" replace by "" in vector let
> l
[1] "" "B" "" "C"
您现在要做的就是删除空元素(如果有的话)。
如果你只有一个符号行,那么 gsub 可以工作:
> let <- " a b c d g h a a a"
> let
[1] " a b c d g h a a a"
> l <- gsub("a", "", let)
> l
[1] " b c d g h "
我想您是在问如何使用 tm
库删除 'the' 和 'this' 之类的词?如果是这样,试试这个:
corpus <- tm_map(txt, removeWords, stopwords("english"))
要删除特定字词:
corpus <- tm_map(corpus, removeWords, c("hello","is","it","me","you're","looking","for?"))
编辑:我使用 War 和 Peace 创建了一个示例,该示例有效。在 创建文档术语矩阵之前,尝试将术语转换为小写 。像这样:
library(tm)
# load
txt <- readLines("this.txt")
corpus <- Corpus(VectorSource(txt))
# clean
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, PlainTextDocument)
# create dtm and get terms
dtm <- DocumentTermMatrix(corpus)
dtm$dimnames$Terms
更改代码以适合您的文本文件,输出应与此类似:
dtm$dimnames$Terms
[1] "almost" "anonymous" "anyone" "anywhere" "author" "away"
[7] "aylmer" "book" "chapter" "contents" "copy" "cost"
[13] "date" "david" "ebook" "english" "give" "gutenberg"
[19] "iii" "included" "january" "language" "last" "leo"
[25] "license" "louise" "march" "maude" "may" "one"
[31] "online" "peace" "posting" "project" "restrictions" "reuse"
[37] "start" "terms" "title" "tolstoy" "tolstoytolstoi" "translators"
[43] "updated" "use" "vii" "volunteer" "war" "whatsoever"
[49] "widger" "wwwgutenbergorg"
txt <- readLines("this.txt")
library(tm)
corpus <- Corpus(VectorSource(txt))
corpus <- tm_map (corpus, removePunctuation)
tdm <- TermDocumentMatrix (corpus)
m <- as.matrix (tdm)
d <- data.frame(freq = sort(rowSums(m),decreasing = TRUE))
很难说出你的数据是什么样子的。但是你可以尝试使用简单的查找替换功能的gsub。
gsub("The", "", "HelloThe")
哪个给你
"Hello"
你知道什么是正则表达式吗? 您可以尝试阅读 here 关于 R 函数 gsub 的内容。 这是它如何工作的一个小例子:
> let <- c("A", "B", "A", "C") # My vector of letters
> let
[1] "A" "B" "A" "C"
> # I want delete "A", so this letter I will replace with nothing ("")
> l <- gsub("A", "", let) # "A" replace by "" in vector let
> l
[1] "" "B" "" "C"
您现在要做的就是删除空元素(如果有的话)。
如果你只有一个符号行,那么 gsub 可以工作:
> let <- " a b c d g h a a a"
> let
[1] " a b c d g h a a a"
> l <- gsub("a", "", let)
> l
[1] " b c d g h "
我想您是在问如何使用 tm
库删除 'the' 和 'this' 之类的词?如果是这样,试试这个:
corpus <- tm_map(txt, removeWords, stopwords("english"))
要删除特定字词:
corpus <- tm_map(corpus, removeWords, c("hello","is","it","me","you're","looking","for?"))
编辑:我使用 War 和 Peace 创建了一个示例,该示例有效。在 创建文档术语矩阵之前,尝试将术语转换为小写 。像这样:
library(tm)
# load
txt <- readLines("this.txt")
corpus <- Corpus(VectorSource(txt))
# clean
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, PlainTextDocument)
# create dtm and get terms
dtm <- DocumentTermMatrix(corpus)
dtm$dimnames$Terms
更改代码以适合您的文本文件,输出应与此类似:
dtm$dimnames$Terms
[1] "almost" "anonymous" "anyone" "anywhere" "author" "away"
[7] "aylmer" "book" "chapter" "contents" "copy" "cost"
[13] "date" "david" "ebook" "english" "give" "gutenberg"
[19] "iii" "included" "january" "language" "last" "leo"
[25] "license" "louise" "march" "maude" "may" "one"
[31] "online" "peace" "posting" "project" "restrictions" "reuse"
[37] "start" "terms" "title" "tolstoy" "tolstoytolstoi" "translators"
[43] "updated" "use" "vii" "volunteer" "war" "whatsoever"
[49] "widger" "wwwgutenbergorg"