tm_map(gsub...) 无法替换单词
tm_map(gsub...) fails to replace words
# Loading required libraries
# Set up logistics such as reading in data and setting up corpus
```{r}
# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"
# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")
# Truncate file names so it is only showing "FirstLast-Term"
prez.out=substr(speeches, 6, nchar(speeches)-4)
# Create a vector NA's equal to the length of the number of speeches
length.speeches=rep(NA, length(speeches))
# Create a corpus
ff.all<-Corpus(DirSource(folder.path))
```
# Clean the data
```{r}
# Use tm_map to strip all white spaces to a single space, to lower case case, remove stop words, empty strings and punctuation.
ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will", "must", ""))
问题行
ff.all<-tm_map(ff.all, gsub, 模式 = "free", 替换 = "freedom")
ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)
# tdm.all = a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)
所以我想用一个词根替换相似的词。例如,在文本挖掘项目中将 "free" 替换为 "freedom"。
然后我从 Youtube 教程中学到了这一行:ff.all<-tm_map(ff.all, gsub, pattern = "free", replacement = "freedom" ).
没有这一行,代码运行。
添加此行后,R Studio 在执行此行时给出此错误“Error: inherits(doc, "TextDocument") is not TRUE”:“ tdm.all<-TermDocumentMatrix(ff.all)"
我觉得这应该是一个比较简单的问题,但是我在Whosebug上找不到解决办法。
使用 tm
的内置 crude
数据,我能够通过将 gsub
包装在 content_transformer
调用中来解决您的问题。
ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))
根据我的经验,tm_map
对自定义函数的 returned 对象做了一些奇怪的事情。因此,虽然您的原始行有效 tm_map
并不完全 return 真正的 "Corpus" 这就是导致错误的原因。
作为旁注:
这一行好像什么都没做
ff.all<-tm_map(ff.all, removeWords, 字符(0))
与 ""
相同
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will", "must", ""))
我的完整示例
library(tm)
data(crude)
ff.all <- crude
ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will", "must", ""))
ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))
ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)
# tdm.all = a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)
# Loading required libraries
# Set up logistics such as reading in data and setting up corpus
```{r}
# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"
# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")
# Truncate file names so it is only showing "FirstLast-Term"
prez.out=substr(speeches, 6, nchar(speeches)-4)
# Create a vector NA's equal to the length of the number of speeches
length.speeches=rep(NA, length(speeches))
# Create a corpus
ff.all<-Corpus(DirSource(folder.path))
```
# Clean the data
```{r}
# Use tm_map to strip all white spaces to a single space, to lower case case, remove stop words, empty strings and punctuation.
ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will", "must", ""))
问题行
ff.all<-tm_map(ff.all, gsub, 模式 = "free", 替换 = "freedom")
ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)
# tdm.all = a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)
所以我想用一个词根替换相似的词。例如,在文本挖掘项目中将 "free" 替换为 "freedom"。
然后我从 Youtube 教程中学到了这一行:ff.all<-tm_map(ff.all, gsub, pattern = "free", replacement = "freedom" ). 没有这一行,代码运行。
添加此行后,R Studio 在执行此行时给出此错误“Error: inherits(doc, "TextDocument") is not TRUE”:“ tdm.all<-TermDocumentMatrix(ff.all)"
我觉得这应该是一个比较简单的问题,但是我在Whosebug上找不到解决办法。
使用 tm
的内置 crude
数据,我能够通过将 gsub
包装在 content_transformer
调用中来解决您的问题。
ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))
根据我的经验,tm_map
对自定义函数的 returned 对象做了一些奇怪的事情。因此,虽然您的原始行有效 tm_map
并不完全 return 真正的 "Corpus" 这就是导致错误的原因。
作为旁注:
这一行好像什么都没做 ff.all<-tm_map(ff.all, removeWords, 字符(0))
与 ""
相同
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will", "must", ""))
我的完整示例
library(tm)
data(crude)
ff.all <- crude
ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will", "must", ""))
ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))
ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)
# tdm.all = a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)