R:如何将测试数据映射到训练数据创建的lsa space
R: how to map test data into lsa space created by training data
我正在尝试使用 LSA 进行文本分析。我在 Whosebug 上阅读了许多关于 LSA 的其他帖子,但我还没有找到与我的相似的帖子。如果您知道有一个与我的相似,请将我重定向到它!非常感谢!
这是我创建的示例数据的可重现代码:
创建样本数据集和测试集
sentiment = c(1,1,0,1,0,1,0,0,1,0)
length(sentiment) #10
text = c('im happy', 'this is good', 'what a bummer X(', 'today is kinda okay day for me', 'i somehow messed up big time',
'guess not being promoted is not too bad :]', 'stayhing home is boring :(', 'kids wont stop crying QQ', 'warriors are legendary!', 'stop reading my tweets!!!')
train_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 1 im happy
2: 1 this is good
3: 0 what a bummer X(
4: 1 today is kinda okay day for me
5: 0 i somehow messed up big time
6: 1 guess not being promoted is not too bad :]
7: 0 stayhing home is boring :(
8: 0 kids wont stop crying QQ
9: 1 warriors are legendary!
10: 0 stop reading my tweets!!!
sentiment = c(0,1,0,0)
text = c('running out of things to say...', 'if you are still reading, good for you!', 'nothing ended on a good note today', 'seriously sleep deprived!! >__<')
test_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 0 running out of things to say...
2: 1 if you are still reading, good for you!
3: 0 nothing ended on a good note today
4: 0 seriously sleep deprived!! >__<
训练数据集的预处理
corpus.train = Corpus(VectorSource(train_data$text))
为训练集创建术语文档矩阵
tdm.train = TermDocumentMatrix(
corpus.train,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
转换成矩阵(供以后使用)
train_matrix = as.matrix(tdm.train)
使用训练数据创建一个 lsa space
lsa.train = lsa(tdm.train, dimcalc_share())
设置尺寸#(我在这里随机选择了一个b/c数据量太小无法创建肘形)
k = 6
将训练矩阵投射到新的 LSA 中space
projected.train = fold_in(docvecs = train_matrix, LSAspace = lsa.train)[1:k,]
将以上投影数据转换成矩阵
projected.train.matrix = matrix(projected.train,
nrow = dim(projected.train)[1],
ncol = dim(projected.train)[2])
训练随机森林模型(不知何故,这个步骤不再适用于这个小样本数据......但没关系,在这个问题上不会是一个大问题;但是,如果你能帮助我解决这个问题也有错误,那太棒了!我尝试用谷歌搜索这个错误,但它只是没有修复...)
trcontrol_rf = trainControl(method = "boot", p = .75, trim = T)
model_train_caret = train(x = t(projected.train.matrix), y = train_data$sentiment, method = "rf", trControl = trcontrol_rf)
测试数据集的预处理
基本上我在重复我对训练数据集所做的一切,除了我没有使用测试集来创建自己的 LSA space
corpus.test = Corpus(VectorSource(test_data$text))
为测试集创建术语文档矩阵
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
转换成矩阵(供以后使用)
test_matrix = as.matrix(tdm.test)
将测试矩阵投影到经过训练的 LSA 中space(这里是问题所在)
projected.test = fold_in(docvecs = test_matrix, LSAspace = lsa.train)
但我会得到一个错误:
crossprod(docvecs, LSAspace$tk) 错误:参数不一致
关于这个错误,我没有找到任何有用的 google 搜索结果...(google QQ 只有一个搜索结果页面)
任何帮助深表感谢!谢谢!
构建 LSA 模型时,您使用的是训练数据的词汇表。但是当您为测试数据构建 TermDocumentMatrix 时,您使用的是测试数据的词汇表。 LSA 模型只知道如何处理根据训练数据的词汇表列出的文档。
解决此问题的一种方法是创建测试 TDM,并将 dictionary
设置为训练数据的词汇表:
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removeNumbers = TRUE,
tolower = TRUE,
stopwords = stopwords("en"),
stemming = TRUE,
removePunctuation = TRUE,
weighting = weightTfIdf,
dictionary=rownames(tdm.train)
)
)
我正在尝试使用 LSA 进行文本分析。我在 Whosebug 上阅读了许多关于 LSA 的其他帖子,但我还没有找到与我的相似的帖子。如果您知道有一个与我的相似,请将我重定向到它!非常感谢!
这是我创建的示例数据的可重现代码:
创建样本数据集和测试集
sentiment = c(1,1,0,1,0,1,0,0,1,0)
length(sentiment) #10
text = c('im happy', 'this is good', 'what a bummer X(', 'today is kinda okay day for me', 'i somehow messed up big time',
'guess not being promoted is not too bad :]', 'stayhing home is boring :(', 'kids wont stop crying QQ', 'warriors are legendary!', 'stop reading my tweets!!!')
train_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 1 im happy
2: 1 this is good
3: 0 what a bummer X(
4: 1 today is kinda okay day for me
5: 0 i somehow messed up big time
6: 1 guess not being promoted is not too bad :]
7: 0 stayhing home is boring :(
8: 0 kids wont stop crying QQ
9: 1 warriors are legendary!
10: 0 stop reading my tweets!!!
sentiment = c(0,1,0,0)
text = c('running out of things to say...', 'if you are still reading, good for you!', 'nothing ended on a good note today', 'seriously sleep deprived!! >__<')
test_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 0 running out of things to say...
2: 1 if you are still reading, good for you!
3: 0 nothing ended on a good note today
4: 0 seriously sleep deprived!! >__<
训练数据集的预处理
corpus.train = Corpus(VectorSource(train_data$text))
为训练集创建术语文档矩阵
tdm.train = TermDocumentMatrix(
corpus.train,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
转换成矩阵(供以后使用)
train_matrix = as.matrix(tdm.train)
使用训练数据创建一个 lsa space
lsa.train = lsa(tdm.train, dimcalc_share())
设置尺寸#(我在这里随机选择了一个b/c数据量太小无法创建肘形)
k = 6
将训练矩阵投射到新的 LSA 中space
projected.train = fold_in(docvecs = train_matrix, LSAspace = lsa.train)[1:k,]
将以上投影数据转换成矩阵
projected.train.matrix = matrix(projected.train,
nrow = dim(projected.train)[1],
ncol = dim(projected.train)[2])
训练随机森林模型(不知何故,这个步骤不再适用于这个小样本数据......但没关系,在这个问题上不会是一个大问题;但是,如果你能帮助我解决这个问题也有错误,那太棒了!我尝试用谷歌搜索这个错误,但它只是没有修复...)
trcontrol_rf = trainControl(method = "boot", p = .75, trim = T)
model_train_caret = train(x = t(projected.train.matrix), y = train_data$sentiment, method = "rf", trControl = trcontrol_rf)
测试数据集的预处理
基本上我在重复我对训练数据集所做的一切,除了我没有使用测试集来创建自己的 LSA space
corpus.test = Corpus(VectorSource(test_data$text))
为测试集创建术语文档矩阵
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
转换成矩阵(供以后使用)
test_matrix = as.matrix(tdm.test)
将测试矩阵投影到经过训练的 LSA 中space(这里是问题所在)
projected.test = fold_in(docvecs = test_matrix, LSAspace = lsa.train)
但我会得到一个错误: crossprod(docvecs, LSAspace$tk) 错误:参数不一致
关于这个错误,我没有找到任何有用的 google 搜索结果...(google QQ 只有一个搜索结果页面) 任何帮助深表感谢!谢谢!
构建 LSA 模型时,您使用的是训练数据的词汇表。但是当您为测试数据构建 TermDocumentMatrix 时,您使用的是测试数据的词汇表。 LSA 模型只知道如何处理根据训练数据的词汇表列出的文档。
解决此问题的一种方法是创建测试 TDM,并将 dictionary
设置为训练数据的词汇表:
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removeNumbers = TRUE,
tolower = TRUE,
stopwords = stopwords("en"),
stemming = TRUE,
removePunctuation = TRUE,
weighting = weightTfIdf,
dictionary=rownames(tdm.train)
)
)