我正在尝试创建一个 DocumentTermMatrix,同时保留所有特殊字符
I am trying to create a DocumentTermMatrix while keeping all special characters
尝试在不删除任何特殊字符的情况下使用 R 进行一些文本挖掘。例如下面的 "LKC" 和 "LKC_" 应该是不同的词。相反,它删除了 _ 并使其成为同一个词。我怎样才能做到这一点?
library(tm)
special = c("OLAC_ LA LAC LAC_ LAC_E AC AC_ AC_E AC_ET",
")LK )LKC )LKC- LK LKC LKC-",
"LAC_ LAC_E LKC LKC-")
bagOfWords <- Corpus(VectorSource(special))
mydocsDTM <- DocumentTermMatrix(bagOfWords, control = list(removePunctuation = FALSE,
preserve_intra_word_contractions = FALSE,
preserve_intra_word_dashes = FALSE,
removeNumbers = FALSE,
stopwords = FALSE,
stemming = FALSE
))
inspect(mydocsDTM)
使用 quanteda 包轻松完成,之后您可以转换为 DocumentTermMatrix,或者继续使用 quanteda.
library("quanteda")
qdfm <- dfm(special, tolower = FALSE, what = "fasterword")
qdfm
# Document-feature matrix of: 3 documents, 15 features (57.8% sparse).
# 3 x 15 sparse Matrix of class "dfm"
# features
# docs OLAC_ LA LAC LAC_ LAC_E AC AC_ AC_E AC_ET )LK )LKC )LKC- LK LKC LKC-
# text1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
# text2 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# text3 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
convert(qdfm, to = "tm")
# <<DocumentTermMatrix (documents: 3, terms: 15)>>
# Non-/sparse entries: 19/26
# Sparsity : 58%
# Maximal term length: 5
# Weighting : term frequency (tf)
尝试在不删除任何特殊字符的情况下使用 R 进行一些文本挖掘。例如下面的 "LKC" 和 "LKC_" 应该是不同的词。相反,它删除了 _ 并使其成为同一个词。我怎样才能做到这一点?
library(tm)
special = c("OLAC_ LA LAC LAC_ LAC_E AC AC_ AC_E AC_ET",
")LK )LKC )LKC- LK LKC LKC-",
"LAC_ LAC_E LKC LKC-")
bagOfWords <- Corpus(VectorSource(special))
mydocsDTM <- DocumentTermMatrix(bagOfWords, control = list(removePunctuation = FALSE,
preserve_intra_word_contractions = FALSE,
preserve_intra_word_dashes = FALSE,
removeNumbers = FALSE,
stopwords = FALSE,
stemming = FALSE
))
inspect(mydocsDTM)
使用 quanteda 包轻松完成,之后您可以转换为 DocumentTermMatrix,或者继续使用 quanteda.
library("quanteda")
qdfm <- dfm(special, tolower = FALSE, what = "fasterword")
qdfm
# Document-feature matrix of: 3 documents, 15 features (57.8% sparse).
# 3 x 15 sparse Matrix of class "dfm"
# features
# docs OLAC_ LA LAC LAC_ LAC_E AC AC_ AC_E AC_ET )LK )LKC )LKC- LK LKC LKC-
# text1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
# text2 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# text3 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
convert(qdfm, to = "tm")
# <<DocumentTermMatrix (documents: 3, terms: 15)>>
# Non-/sparse entries: 19/26
# Sparsity : 58%
# Maximal term length: 5
# Weighting : term frequency (tf)