如何从 dtm 中抽取 75% 的行?
How to sample 75 percent of rows from a dtm?
如何对 dtm 进行采样?我尝试了很多代码,但 return 我遇到了同样的错误
Error in dtm[splitter, ] : incorrect number of dimensions
这是代码:
n <- dtm$nrow
splitter <- sample(1:n, round(n * 0.75))
train_set <- dtm[splitter, ]
valid_set <- dtm[-splitter, ]
您可以为此使用 quanteda 包。请参阅下面的示例:
根据来自 tm 的原始数据集创建数据示例:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
crude <- tm_map(crude, stemDocument)
dtm <- DocumentTermMatrix(crude)
library(quanteda)
# Transform your dtm into a dfm for quanteda
my_dfm <- as.dfm(dtm)
# number of documents
ndocs(my_dfm)
[1] 20
set.seed(4242)
# create training
train_set <- dfm_sample(my_dfm,
size = round(ndoc(my_dfm) * 0.75), # set sample size
margin = "documents")
# create test set by select the documents that do not match the documents in the training set.
test_set <- dfm_subset(my_dfm, !docnames(my_dfm) %in% docnames(train_set))
# number of documents in train
ndoc(train_set)
[1] 15
# number of documents in test
ndoc(test_set)
[1] 5
之后您可以使用 quanteda 函数 convert
将您的训练集和测试集转换为与主题模型、lda、lsa 等一起使用。有关详细信息,请参阅 ?convert
。
尝试使用插入符号包:
library(caret)
#help(package="caret")
index <- createDataPartition(sample, times = 1, p=0.75, list = FALSE)
train <- news.raw[index,]
test <- news.raw[-index,]
希望对您有所帮助!
如何对 dtm 进行采样?我尝试了很多代码,但 return 我遇到了同样的错误
Error in dtm[splitter, ] : incorrect number of dimensions
这是代码:
n <- dtm$nrow
splitter <- sample(1:n, round(n * 0.75))
train_set <- dtm[splitter, ]
valid_set <- dtm[-splitter, ]
您可以为此使用 quanteda 包。请参阅下面的示例:
根据来自 tm 的原始数据集创建数据示例:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
crude <- tm_map(crude, stemDocument)
dtm <- DocumentTermMatrix(crude)
library(quanteda)
# Transform your dtm into a dfm for quanteda
my_dfm <- as.dfm(dtm)
# number of documents
ndocs(my_dfm)
[1] 20
set.seed(4242)
# create training
train_set <- dfm_sample(my_dfm,
size = round(ndoc(my_dfm) * 0.75), # set sample size
margin = "documents")
# create test set by select the documents that do not match the documents in the training set.
test_set <- dfm_subset(my_dfm, !docnames(my_dfm) %in% docnames(train_set))
# number of documents in train
ndoc(train_set)
[1] 15
# number of documents in test
ndoc(test_set)
[1] 5
之后您可以使用 quanteda 函数 convert
将您的训练集和测试集转换为与主题模型、lda、lsa 等一起使用。有关详细信息,请参阅 ?convert
。
尝试使用插入符号包:
library(caret)
#help(package="caret")
index <- createDataPartition(sample, times = 1, p=0.75, list = FALSE)
train <- news.raw[index,]
test <- news.raw[-index,]
希望对您有所帮助!