清理语料库时 tm 包函数未删除引号和连字符
Quotes and hyphens not removed by tm package functions while cleaning corpus
我正在尝试清理语料库并且我使用了典型的步骤,如下面的代码:
docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)
然而,当我检查矩阵时,有几个单词带有引号,例如:
"we"
"company"
“代码
指引”
-已知
-加速
似乎单词本身在引号内,但是当我再次尝试 运行 删除标点符号时,它不起作用。还有一些前面带项目符号的词我也无法删除。
如有任何帮助,我们将不胜感激。
removePunctuation
使用 gsub('[[:punct:]]','',x)
即删除符号:!"#$%&'()*+, \-./:;<=>?@[\\]^_
{|}~`。要删除其他符号,如印刷引号或项目符号(或任何其他符号),请声明您自己的转换函数:
removeSpecialChars <- function(x) gsub("“•”","",x)
docs <- tm_map(docs, removeSpecialChars)
或者您可以更进一步,删除所有非字母数字符号或 space:
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, removeSpecialChars)
一个更好的分词器会自动处理这个问题。试试这个:
> require(quanteda)
> text <- c("Enjoying \"my time\".", "Single 'air quotes'.")
> toktexts <- tokenize(toLower(text), removePunct = TRUE, removeNumbers = TRUE)
> toktexts
[[1]]
[1] "enjoying" "my" "time"
[[2]]
[1] "single" "air" "quotes"
attr(,"class")
[1] "tokenizedTexts" "list"
> dfm(toktexts, stem = TRUE, ignoredFeatures = stopwords("english"), verbose = FALSE)
Creating a dfm from a tokenizedTexts object ...
... indexing 2 documents
... shaping tokens into data.table, found 6 total tokens
... stemming the tokens (english)
... ignoring 174 feature types, discarding 1 total features (16.7%)
... summing tokens by document
... indexing 5 feature types
... building sparse matrix
... created a 2 x 5 sparse dfm
... complete. Elapsed time: 0.016 seconds.
Document-feature matrix of: 2 documents, 5 features.
2 x 5 sparse Matrix of class "dfmSparse"
features
docs air enjoy quot singl time
text1 0 1 0 0 1
text2 1 0 1 1 0
@cyberj0g 的回答需要对 tm
(0.6) 的最新版本进行小的修改。
更新后的代码可以这样写:
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
corpus <- tm_map(corpus, content_transformer(removeSpecialChars))
感谢@cyberj0g 的工作代码
我正在尝试清理语料库并且我使用了典型的步骤,如下面的代码:
docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)
然而,当我检查矩阵时,有几个单词带有引号,例如: "we" "company" “代码 指引” -已知 -加速
似乎单词本身在引号内,但是当我再次尝试 运行 删除标点符号时,它不起作用。还有一些前面带项目符号的词我也无法删除。
如有任何帮助,我们将不胜感激。
removePunctuation
使用 gsub('[[:punct:]]','',x)
即删除符号:!"#$%&'()*+, \-./:;<=>?@[\\]^_
{|}~`。要删除其他符号,如印刷引号或项目符号(或任何其他符号),请声明您自己的转换函数:
removeSpecialChars <- function(x) gsub("“•”","",x)
docs <- tm_map(docs, removeSpecialChars)
或者您可以更进一步,删除所有非字母数字符号或 space:
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, removeSpecialChars)
一个更好的分词器会自动处理这个问题。试试这个:
> require(quanteda)
> text <- c("Enjoying \"my time\".", "Single 'air quotes'.")
> toktexts <- tokenize(toLower(text), removePunct = TRUE, removeNumbers = TRUE)
> toktexts
[[1]]
[1] "enjoying" "my" "time"
[[2]]
[1] "single" "air" "quotes"
attr(,"class")
[1] "tokenizedTexts" "list"
> dfm(toktexts, stem = TRUE, ignoredFeatures = stopwords("english"), verbose = FALSE)
Creating a dfm from a tokenizedTexts object ...
... indexing 2 documents
... shaping tokens into data.table, found 6 total tokens
... stemming the tokens (english)
... ignoring 174 feature types, discarding 1 total features (16.7%)
... summing tokens by document
... indexing 5 feature types
... building sparse matrix
... created a 2 x 5 sparse dfm
... complete. Elapsed time: 0.016 seconds.
Document-feature matrix of: 2 documents, 5 features.
2 x 5 sparse Matrix of class "dfmSparse"
features
docs air enjoy quot singl time
text1 0 1 0 0 1
text2 1 0 1 1 0
@cyberj0g 的回答需要对 tm
(0.6) 的最新版本进行小的修改。
更新后的代码可以这样写:
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
corpus <- tm_map(corpus, content_transformer(removeSpecialChars))
感谢@cyberj0g 的工作代码