如何在文本挖掘时保留单词的原始结构
How to preserve the original structure of a word while textmining
我想根据特定网页创建至少出现两次的单词列表。
我成功地获取了数据并获得了每个单词的计数列表,但是
我需要保留大写的单词以保留此 way.Right 现在代码只生成小写的单词列表。
例如,单词 "Miami" 变成 "miami" 而我需要它作为 "Miami".
我怎样才能得到单词的原始结构?
附上代码:
library(XML)
web_page <- htmlTreeParse("http://www.larryslist.com/artmarket/the-talks/dennis-scholls-multiple-roles-from-collecting-art-to-winning-emmy-awards/"
,useInternal = TRUE)
doctext = unlist(xpathApply(web_page, '//p', xmlValue))
doctext = gsub('\n', ' ', doctext)
doctext = paste(doctext, collapse = ' ')
library(tm)
SampCrps<- Corpus(VectorSource(doctext))
corp <- tm_map(SampCrps, PlainTextDocument)
oz <- tm_map(corp, removePunctuation, preserve_intra_word_dashes = FALSE) # remove punctuation
oz <- tm_map(corp, removeWords, stopwords("english")) # remove stopwords
dtm <-DocumentTermMatrix(oz)
findFreqTerms(dtm,2) # words that apear at least 2 times
dtmMatrix <- as.matrix(dtm)
wordsFreq <- colSums(dtmMatrix)
wordsFreq <- sort(wordsFreq, decreasing=TRUE)
head(wordsFreq)
wordsFreq <- as.data.frame(wordsFreq)
wordsFreq <- data.frame(word = rownames(wordsFreq), count = wordsFreq, row.names = NULL)
head(wordsFreq,50)
同样的问题出现在我用这行代码得到一个三词的ngram:
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(oz, control = list(tokenize = BigramTokenizer))
inspect(tdm)
问题是默认情况下,DocumentTermMatrix()
中有一个选项可以将您的术语小写。关闭它,您将保留大小写。
dtm <- DocumentTermMatrix(oz, control = list(tolower = FALSE))
colnames(dtm)[grep(".iami", colnames(dtm))]
## [1] "Miami" "Miami," "Miami." "Miami’s"
这是使用 quanteda 包的另一种方法,可能更直接:
require(quanteda)
# straight from text to the matrix
dfmMatrix <- dfm(doctext, removeHyphens = TRUE, toLower = FALSE,
ignoredFeatures = stopwords("english"), verbose = FALSE)
# gets frequency counts, sorted in descending order of total term frequency
termfreqs <- topfeatures(dfmMatrix, n = nfeature(dfmMatrix))
# remove those with frequency < 2
termfreqs <- termfreqs[termfreqs >= 2]
head(termfreqs, 20)
## art I artists collecting work We collection collectors
## 35 29 19 17 15 14 13 12
## What contemporary The world us It Miami one
## 11 10 10 10 10 9 9 8
## always many make Art
## 8 8 8 7
我们可以看到 "Miami" 的大小写(例如)被保留:
termfreqs[grep(".iami", names(termfreqs))]
## Miami Miami’s
## 9 2
我想根据特定网页创建至少出现两次的单词列表。 我成功地获取了数据并获得了每个单词的计数列表,但是 我需要保留大写的单词以保留此 way.Right 现在代码只生成小写的单词列表。 例如,单词 "Miami" 变成 "miami" 而我需要它作为 "Miami".
我怎样才能得到单词的原始结构?
附上代码:
library(XML)
web_page <- htmlTreeParse("http://www.larryslist.com/artmarket/the-talks/dennis-scholls-multiple-roles-from-collecting-art-to-winning-emmy-awards/"
,useInternal = TRUE)
doctext = unlist(xpathApply(web_page, '//p', xmlValue))
doctext = gsub('\n', ' ', doctext)
doctext = paste(doctext, collapse = ' ')
library(tm)
SampCrps<- Corpus(VectorSource(doctext))
corp <- tm_map(SampCrps, PlainTextDocument)
oz <- tm_map(corp, removePunctuation, preserve_intra_word_dashes = FALSE) # remove punctuation
oz <- tm_map(corp, removeWords, stopwords("english")) # remove stopwords
dtm <-DocumentTermMatrix(oz)
findFreqTerms(dtm,2) # words that apear at least 2 times
dtmMatrix <- as.matrix(dtm)
wordsFreq <- colSums(dtmMatrix)
wordsFreq <- sort(wordsFreq, decreasing=TRUE)
head(wordsFreq)
wordsFreq <- as.data.frame(wordsFreq)
wordsFreq <- data.frame(word = rownames(wordsFreq), count = wordsFreq, row.names = NULL)
head(wordsFreq,50)
同样的问题出现在我用这行代码得到一个三词的ngram:
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(oz, control = list(tokenize = BigramTokenizer))
inspect(tdm)
问题是默认情况下,DocumentTermMatrix()
中有一个选项可以将您的术语小写。关闭它,您将保留大小写。
dtm <- DocumentTermMatrix(oz, control = list(tolower = FALSE))
colnames(dtm)[grep(".iami", colnames(dtm))]
## [1] "Miami" "Miami," "Miami." "Miami’s"
这是使用 quanteda 包的另一种方法,可能更直接:
require(quanteda)
# straight from text to the matrix
dfmMatrix <- dfm(doctext, removeHyphens = TRUE, toLower = FALSE,
ignoredFeatures = stopwords("english"), verbose = FALSE)
# gets frequency counts, sorted in descending order of total term frequency
termfreqs <- topfeatures(dfmMatrix, n = nfeature(dfmMatrix))
# remove those with frequency < 2
termfreqs <- termfreqs[termfreqs >= 2]
head(termfreqs, 20)
## art I artists collecting work We collection collectors
## 35 29 19 17 15 14 13 12
## What contemporary The world us It Miami one
## 11 10 10 10 10 9 9 8
## always many make Art
## 8 8 8 7
我们可以看到 "Miami" 的大小写(例如)被保留:
termfreqs[grep(".iami", names(termfreqs))]
## Miami Miami’s
## 9 2