将 Quanteda dfm 转换为 stm
Converting Quanteda dfm to stm
我将 tm 语料库转换为 quanteda 语料库。我申请dfm。然后我将 dfm 转换为 stm 格式。这段代码在 15 分钟前都运行良好;我所做的只是添加更多要删除的单词到自定义列表 (myRMlist) 中。我很困惑。有什么建议么?
data(tmCorpus, package = "tm")
Qcorpus <- corpus(tmCorpus)
summary(Qcorpus, showmeta=TRUE)
myRMlist <- readLines("myremovelist2.txt", encoding = "UTF-8")
Qcorpus.dfm <- dfm(Qcorpus, remove = myRMlist )
Qcorpus.dfm <- dfm(Qcorpus.dfm, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove = stopwords("en"), stem = FALSE)
Qcorpus.dfm <- dfm(Qcorpus.dfm, remove = stopwords(("es")))
Qcorpus.stm <- convert(Qcorpus.dfm, to = "stm")
Error in convert(Qcorpus.dfm, to = "stm") : unused argument (to = "stm")
很难重现你的错误,因为我没有所有的输入,但我尝试重新创建一组要删除的自定义词,这对我来说都很有效。
但是还有更好的方法来完成您想要做的事情,我在此处列出。
首先,对我来说,转换成功了。但是有更好的方法可以到达那里:首先,创建标记对象,删除单词列表,然后构建 dfm .然后,转换为 stm 格式。
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
# set up data
data(crude, package = "tm")
Qcorpus <- corpus(crude)
# simulate words to remove, not supplied
myRMlist <- readLines(textConnection(c("and", "or", "but", "of")))
# conversion works
stm_input_stm <- Qcorpus %>%
tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(myRMlist, stopwords("en"))) %>%
dfm() %>%
convert(to = "stm")
但是不需要用stm转换,因为stm::stm()
可以直接将dfm作为输入:
# stm can take a dfm directly
stm_input_dfm <- Qcorpus %>%
tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(myRMlist, stopwords("en"))) %>%
dfm()
library("stm")
## stm v1.3.5 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
stm(stm_input_dfm, K = 5)
## Beginning Spectral Initialization
## Calculating the gram matrix...
## Finding anchor words...
## .....
## Recovering initialization...
## .........
## Initialization complete.
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -6.022)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -5.480, relative change = 9.000e-02)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -5.386, relative change = 1.708e-02)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -5.370, relative change = 2.987e-03)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -5.367, relative change = 6.841e-04)
## Topic 1: said, mln, oil, last, billion
## Topic 2: oil, dlrs, said, crude, price
## Topic 3: oil, said, power, ship, crude
## Topic 4: oil, opec, said, prices, market
## Topic 5: oil, said, one, futures, mln
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -5.366, relative change = 1.601e-04)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -5.366, relative change = 5.444e-05)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -5.365, relative change = 1.856e-05)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Model Converged
## A topic model with 5 topics, 20 documents and a 971 word dictionary.
我将 tm 语料库转换为 quanteda 语料库。我申请dfm。然后我将 dfm 转换为 stm 格式。这段代码在 15 分钟前都运行良好;我所做的只是添加更多要删除的单词到自定义列表 (myRMlist) 中。我很困惑。有什么建议么?
data(tmCorpus, package = "tm")
Qcorpus <- corpus(tmCorpus)
summary(Qcorpus, showmeta=TRUE)
myRMlist <- readLines("myremovelist2.txt", encoding = "UTF-8")
Qcorpus.dfm <- dfm(Qcorpus, remove = myRMlist )
Qcorpus.dfm <- dfm(Qcorpus.dfm, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove = stopwords("en"), stem = FALSE)
Qcorpus.dfm <- dfm(Qcorpus.dfm, remove = stopwords(("es")))
Qcorpus.stm <- convert(Qcorpus.dfm, to = "stm")
Error in convert(Qcorpus.dfm, to = "stm") : unused argument (to = "stm")
很难重现你的错误,因为我没有所有的输入,但我尝试重新创建一组要删除的自定义词,这对我来说都很有效。
但是还有更好的方法来完成您想要做的事情,我在此处列出。
首先,对我来说,转换成功了。但是有更好的方法可以到达那里:首先,创建标记对象,删除单词列表,然后构建 dfm .然后,转换为 stm 格式。
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
# set up data
data(crude, package = "tm")
Qcorpus <- corpus(crude)
# simulate words to remove, not supplied
myRMlist <- readLines(textConnection(c("and", "or", "but", "of")))
# conversion works
stm_input_stm <- Qcorpus %>%
tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(myRMlist, stopwords("en"))) %>%
dfm() %>%
convert(to = "stm")
但是不需要用stm转换,因为stm::stm()
可以直接将dfm作为输入:
# stm can take a dfm directly
stm_input_dfm <- Qcorpus %>%
tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(myRMlist, stopwords("en"))) %>%
dfm()
library("stm")
## stm v1.3.5 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
stm(stm_input_dfm, K = 5)
## Beginning Spectral Initialization
## Calculating the gram matrix...
## Finding anchor words...
## .....
## Recovering initialization...
## .........
## Initialization complete.
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -6.022)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -5.480, relative change = 9.000e-02)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -5.386, relative change = 1.708e-02)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -5.370, relative change = 2.987e-03)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -5.367, relative change = 6.841e-04)
## Topic 1: said, mln, oil, last, billion
## Topic 2: oil, dlrs, said, crude, price
## Topic 3: oil, said, power, ship, crude
## Topic 4: oil, opec, said, prices, market
## Topic 5: oil, said, one, futures, mln
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -5.366, relative change = 1.601e-04)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -5.366, relative change = 5.444e-05)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -5.365, relative change = 1.856e-05)
## ....................
## Completed E-Step (0 seconds).
## Completed M-Step.
## Model Converged
## A topic model with 5 topics, 20 documents and a 971 word dictionary.