R 中的词干补全替换名称,而不是数据
Stem completion in R replaces names, not data
我的团队正在使用 R 中的 Quanteda 包对中等大小的文本块(数万个单词)进行一些主题建模。我想在主题建模过程之前将单词减少为词干,所以我不计算同一个词作为不同主题的变体。
唯一的问题是词干提取算法会留下一些不是真正单词的单词。 "Happiness" 源于 "happi," "arrange" 源于 "arrang," 等等。因此,在可视化主题建模的结果之前,我想将词干还原为完整的单词。
通过阅读 Whosebug 上的一些先前线程,我发现了 TM 包中的函数 stemCompletion(),它可以执行此操作,at least approximately。它似乎工作得相当好。
但是当我将它应用于文档文本矩阵中的术语向量时,stemCompletion() 总是替换字符向量的名称,而不是字符本身。这是一个可重现的例子:
# Set up libraries
library(janeaustenr)
library(quanteda)
library(tm)
# Get first 200 words of Mansfield Park
words <- head(mansfieldpark, 200)
# Build a corpus from words
corpus <- quanteda::corpus(words)
# Eliminate some words from counting process
STOPWORDS <- c("the", "and", "a", "an")
# Create a document text matrix and do topic modeling
dtm <- corpus %>%
quanteda::dfm(remove_punct = TRUE,
remove = STOPWORDS) %>%
quanteda::dfm_wordstem(.) %>% # Word stemming takes place here
quanteda::convert("topicmodels")
# Word stems are now stored in dtm$dimnames$Terms
# View a sample of stemmed terms
tail(dtm$dimnames$Terms, 20)
# View the structure of dtm$dimnames$Terms (It's just a character vector)
str(dtm$dimnames$Terms)
# Apply tm::stemCompletion to Terms
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = words, # or corpus
type = "shortest")
# Result is composed entirely of NAs, with the values stored as names!
str(unstemmed_terms)
tail(unstemmed_terms, 20)
我正在寻找一种方法将 stemCompletion() 返回的结果放入字符向量中,而不是放入字符向量的名称属性中。非常感谢对此问题的任何见解。
问题是你对 tm::stemCompletion()
的 dictionary
参数不是单词的字符向量(或 tm 语料库对象),而是一个奥斯汀小说中的一组台词。
tail(words)
# [1] "most liberal-minded sister and aunt in the world."
# [2] ""
# [3] "When the subject was brought forward again, her views were more fully"
# [4] "explained; and, in reply to Lady Bertram's calm inquiry of \"Where shall"
# [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with"
# [6] "some surprise that it would be totally out of Mrs. Norris's power to"
但这可以使用 quanteda 的 tokens()
轻松标记化,并将其转换为字符向量。
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = as.character(tokens(words, remove_punct = TRUE)),
type = "shortest")
tail(unstemmed_terms, 20)
# arrang chariti perhap parsonag convers happi
# "arranging" NA "perhaps" NA "conversation" "happily"
# belief most liberal-mind aunt again view
# "belief" "most" "liberal-minded" "aunt" "again" "views"
# explain calm inquiri where come heard
# "explained" "calm" NA NA "come" "heard"
# surpris total
# "surprise" "totally"
我的团队正在使用 R 中的 Quanteda 包对中等大小的文本块(数万个单词)进行一些主题建模。我想在主题建模过程之前将单词减少为词干,所以我不计算同一个词作为不同主题的变体。
唯一的问题是词干提取算法会留下一些不是真正单词的单词。 "Happiness" 源于 "happi," "arrange" 源于 "arrang," 等等。因此,在可视化主题建模的结果之前,我想将词干还原为完整的单词。
通过阅读 Whosebug 上的一些先前线程,我发现了 TM 包中的函数 stemCompletion(),它可以执行此操作,at least approximately。它似乎工作得相当好。
但是当我将它应用于文档文本矩阵中的术语向量时,stemCompletion() 总是替换字符向量的名称,而不是字符本身。这是一个可重现的例子:
# Set up libraries
library(janeaustenr)
library(quanteda)
library(tm)
# Get first 200 words of Mansfield Park
words <- head(mansfieldpark, 200)
# Build a corpus from words
corpus <- quanteda::corpus(words)
# Eliminate some words from counting process
STOPWORDS <- c("the", "and", "a", "an")
# Create a document text matrix and do topic modeling
dtm <- corpus %>%
quanteda::dfm(remove_punct = TRUE,
remove = STOPWORDS) %>%
quanteda::dfm_wordstem(.) %>% # Word stemming takes place here
quanteda::convert("topicmodels")
# Word stems are now stored in dtm$dimnames$Terms
# View a sample of stemmed terms
tail(dtm$dimnames$Terms, 20)
# View the structure of dtm$dimnames$Terms (It's just a character vector)
str(dtm$dimnames$Terms)
# Apply tm::stemCompletion to Terms
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = words, # or corpus
type = "shortest")
# Result is composed entirely of NAs, with the values stored as names!
str(unstemmed_terms)
tail(unstemmed_terms, 20)
我正在寻找一种方法将 stemCompletion() 返回的结果放入字符向量中,而不是放入字符向量的名称属性中。非常感谢对此问题的任何见解。
问题是你对 tm::stemCompletion()
的 dictionary
参数不是单词的字符向量(或 tm 语料库对象),而是一个奥斯汀小说中的一组台词。
tail(words)
# [1] "most liberal-minded sister and aunt in the world."
# [2] ""
# [3] "When the subject was brought forward again, her views were more fully"
# [4] "explained; and, in reply to Lady Bertram's calm inquiry of \"Where shall"
# [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with"
# [6] "some surprise that it would be totally out of Mrs. Norris's power to"
但这可以使用 quanteda 的 tokens()
轻松标记化,并将其转换为字符向量。
unstemmed_terms <-
tm::stemCompletion(dtm$dimnames$Terms,
dictionary = as.character(tokens(words, remove_punct = TRUE)),
type = "shortest")
tail(unstemmed_terms, 20)
# arrang chariti perhap parsonag convers happi
# "arranging" NA "perhaps" NA "conversation" "happily"
# belief most liberal-mind aunt again view
# "belief" "most" "liberal-minded" "aunt" "again" "views"
# explain calm inquiri where come heard
# "explained" "calm" NA NA "come" "heard"
# surpris total
# "surprise" "totally"