R中的文档术语矩阵
Document term matrix in R
我有以下代码:
rm(list=ls(all=TRUE)) #clear data
setwd("~/UCSB/14 Win 15/Issy/text.fwt") #set working directory
files <- list.files(); head(files) #load & check working directory
fw1 <- scan(what="c", sep="\n",file="fw_chp01.fwt")
library(tm)
corpus2<-Corpus(VectorSource(c(fw1)))
skipWords<-(function(x) removeWords(x, stopwords("english")))
#remove punc, numbers, stopwords, etc
funcs<-list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc<-tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(1,110))) #create document term matrix
我正在尝试使用 tm 参考手册 (http://cran.r-project.org/web/packages/tm/tm.pdf) 中详述的一些操作,但收效甚微。例如,当我尝试使用 findFreqTerms 时,出现以下错误:
Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE
谁能告诉我为什么这不起作用以及我可以做些什么来解决它?
为@lawyeR 编辑:
head(fw1) 生成文本的前六行(James Joyce 的 Finnegans Wake 第 1 集):
[1] "003.01 riverrun, past Eve and Adam's, from swerve of shore to bend"
[2] "003.02 of bay, brings us by a commodius vicus of recirculation back to"
[3] "003.03 Howth Castle and Environs."
[4] "003.04 Sir Tristram, violer d'amores, fr'over the short sea, had passen-"
[5] "003.05 core rearrived from North Armorica on this side the scraggy"
[6] "003.06 isthmus of Europe Minor to wielderfight his penisolate war: nor"
inspect(corpus2) 按以下格式输出文本的每一行(这是文本的最后一行):
[[960]]
<<PlainTextDocument (metadata: 7)>>
029.36 borough. #this part differs by line of course
inspect(corpus2a.dtm) returns所有类型的table(一共4163个(文中格式如下:
Docs youths yoxen yu yurap yutah zee zephiroth zine zingzang zmorde zoom
1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
这是您提供和执行的内容的简化形式,tm
完成了它的工作。可能是您的一个或多个清洁步骤导致了问题。
> library(tm)
> fw1 <- c("riverrun, past Eve and Adam's, from swerve of shore to bend
+ of bay, brings us by a commodius vicus of recirculation back to
+ Howth Castle and Environs.
+ Sir Tristram, violer d'amores, fr'over the short sea, had passen-
+ core rearrived from North Armorica on this side the scraggy
+ isthmus of Europe Minor to wielderfight his penisolate war: nor")
>
> corpus<-Corpus(VectorSource(c(fw1)))
> inspect(corpus)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
riverrun, past Eve and Adam's, from swerve of shore to bend
of bay, brings us by a commodius vicus of recirculation back to
Howth Castle and Environs.
Sir Tristram, violer d'amores, fr'over the short sea, had passen-
core rearrived from North Armorica on this side the scraggy
isthmus of Europe Minor to wielderfight his penisolate war: nor
> dtm <- DocumentTermMatrix(corpus)
> findFreqTerms(dtm)
[1] "adam's," "and" "armorica" "back" "bay," "bend"
[7] "brings" "castle" "commodius" "core" "d'amores," "environs."
[13] "europe" "eve" "fr'over" "from" "had" "his"
[19] "howth" "isthmus" "minor" "nor" "north" "passen-"
[25] "past" "penisolate" "rearrived" "recirculation" "riverrun," "scraggy"
[31] "sea," "shore" "short" "side" "sir" "swerve"
[37] "the" "this" "tristram," "vicus" "violer" "war:"
[43] "wielderfight"
另外一点,我发现在开始时将一些其他补充包加载到 tm
很有用。
library(SnowballC); library(RWeka); library(rJava); library(RWekajars)
相对于你有些复杂的清理步骤,我通常是这样跋涉的(用你的文本向量替换 comments$comment):
comments$comment <- tolower(comments$comment)
comments$comment <- removeNumbers(comments$comment)
comments$comment <- stripWhitespace(comments$comment)
comments$comment <- str_replace_all(comments$comment, " ", " ")
# replace all double spaces internally with single space
# better to remove punctuation with str_ because the tm function doesn't insert a space
library(stringr)
comments$comment <- str_replace_all(comments$comment, pattern = "[[:punct:]]", " ")
comments$comment <- removeWords(comments$comment, stopwords(kind = "english"))
从另一张票中这应该有助于 tm 0.6.0 有一个错误,可以用这个声明解决。
corpus_clean <- tm_map( corp_stemmed, PlainTextDocument)
希望对您有所帮助。
我有以下代码:
rm(list=ls(all=TRUE)) #clear data
setwd("~/UCSB/14 Win 15/Issy/text.fwt") #set working directory
files <- list.files(); head(files) #load & check working directory
fw1 <- scan(what="c", sep="\n",file="fw_chp01.fwt")
library(tm)
corpus2<-Corpus(VectorSource(c(fw1)))
skipWords<-(function(x) removeWords(x, stopwords("english")))
#remove punc, numbers, stopwords, etc
funcs<-list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus2.proc<-tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
corpus2a.dtm <- DocumentTermMatrix(corpus2.proc, control = list(wordLengths = c(1,110))) #create document term matrix
我正在尝试使用 tm 参考手册 (http://cran.r-project.org/web/packages/tm/tm.pdf) 中详述的一些操作,但收效甚微。例如,当我尝试使用 findFreqTerms 时,出现以下错误:
Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE
谁能告诉我为什么这不起作用以及我可以做些什么来解决它?
为@lawyeR 编辑:
head(fw1) 生成文本的前六行(James Joyce 的 Finnegans Wake 第 1 集):
[1] "003.01 riverrun, past Eve and Adam's, from swerve of shore to bend"
[2] "003.02 of bay, brings us by a commodius vicus of recirculation back to"
[3] "003.03 Howth Castle and Environs."
[4] "003.04 Sir Tristram, violer d'amores, fr'over the short sea, had passen-"
[5] "003.05 core rearrived from North Armorica on this side the scraggy"
[6] "003.06 isthmus of Europe Minor to wielderfight his penisolate war: nor"
inspect(corpus2) 按以下格式输出文本的每一行(这是文本的最后一行):
[[960]]
<<PlainTextDocument (metadata: 7)>>
029.36 borough. #this part differs by line of course
inspect(corpus2a.dtm) returns所有类型的table(一共4163个(文中格式如下:
Docs youths yoxen yu yurap yutah zee zephiroth zine zingzang zmorde zoom
1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
这是您提供和执行的内容的简化形式,tm
完成了它的工作。可能是您的一个或多个清洁步骤导致了问题。
> library(tm)
> fw1 <- c("riverrun, past Eve and Adam's, from swerve of shore to bend
+ of bay, brings us by a commodius vicus of recirculation back to
+ Howth Castle and Environs.
+ Sir Tristram, violer d'amores, fr'over the short sea, had passen-
+ core rearrived from North Armorica on this side the scraggy
+ isthmus of Europe Minor to wielderfight his penisolate war: nor")
>
> corpus<-Corpus(VectorSource(c(fw1)))
> inspect(corpus)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
riverrun, past Eve and Adam's, from swerve of shore to bend
of bay, brings us by a commodius vicus of recirculation back to
Howth Castle and Environs.
Sir Tristram, violer d'amores, fr'over the short sea, had passen-
core rearrived from North Armorica on this side the scraggy
isthmus of Europe Minor to wielderfight his penisolate war: nor
> dtm <- DocumentTermMatrix(corpus)
> findFreqTerms(dtm)
[1] "adam's," "and" "armorica" "back" "bay," "bend"
[7] "brings" "castle" "commodius" "core" "d'amores," "environs."
[13] "europe" "eve" "fr'over" "from" "had" "his"
[19] "howth" "isthmus" "minor" "nor" "north" "passen-"
[25] "past" "penisolate" "rearrived" "recirculation" "riverrun," "scraggy"
[31] "sea," "shore" "short" "side" "sir" "swerve"
[37] "the" "this" "tristram," "vicus" "violer" "war:"
[43] "wielderfight"
另外一点,我发现在开始时将一些其他补充包加载到 tm
很有用。
library(SnowballC); library(RWeka); library(rJava); library(RWekajars)
相对于你有些复杂的清理步骤,我通常是这样跋涉的(用你的文本向量替换 comments$comment):
comments$comment <- tolower(comments$comment)
comments$comment <- removeNumbers(comments$comment)
comments$comment <- stripWhitespace(comments$comment)
comments$comment <- str_replace_all(comments$comment, " ", " ")
# replace all double spaces internally with single space
# better to remove punctuation with str_ because the tm function doesn't insert a space
library(stringr)
comments$comment <- str_replace_all(comments$comment, pattern = "[[:punct:]]", " ")
comments$comment <- removeWords(comments$comment, stopwords(kind = "english"))
从另一张票中这应该有助于 tm 0.6.0 有一个错误,可以用这个声明解决。
corpus_clean <- tm_map( corp_stemmed, PlainTextDocument)
希望对您有所帮助。