如何附加到 R 中的文档术语矩阵?
How do I append to a Document Term Matrix in R?
我想将两个文档术语矩阵附加在一起。我有一行数据,想对它们使用不同的控制函数(一个 n-gram 分词器、删除停用词和文本的 wordLength 边界,其中 none 个用于我的非文本字段)。
当我使用 tm_combine: c(dtm_text,dtm_inputs) 时,它会将第二组添加为新行。我想将这些属性附加到同一行。
library("tm")
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
# NGram tokenize text data
dtm_text <- DocumentTermMatrix(Corpus(VectorSource(txt_fields)),
control = list(
tokenize = BigramTokenizer,
stopwords=TRUE,
wordLengths=c(2, Inf),
bounds=list(global = c(1,Inf))))
# Do not perform tokenization of other inputs
dtm_inputs <- DocumentTermMatrix(Corpus(VectorSource(other_inputs)),
control = list(
bounds = list(global = c(1,Inf))))
# DESIRED OUTPUT
<<DocumentTermMatrix (documents: 1, terms: 12)>>
Non-/sparse entries: 12/0
Sparsity : 0%
Maximal term length: 13
Weighting : term frequency (tf)
Terms
Docs am happy happy like like your love love your products products am store store love
1 1 1 1 1 1 1 1 1 1 1
Terms
Docs your products your store cd1_abc cd2_555 cd3_7654
1 1 1 1
1 1 1
dtm_combined = as.DocumentTermMatrix(cbind(dtm_text, dtm_inputs), weighting = weightTf)
inspect(dtm_combined)
# <<DocumentTermMatrix (documents: 1, terms: 8)>>
# Non-/sparse entries: 8/0
# Sparsity : 0%
# Maximal term length: 8
# Weighting : term frequency (tf)
#
# Terms
# Docs happy like love products store cd1_abc cd2_555 cd3_7654
# 1 1 1 1 1 1 1 1 1
但是如果你在dtm_text
和dtm_inputs
中有相同的词,它会给出错误的结果。这个词不会合并,并且会在 dtm_combined
中出现两次。
我建议使用text2vec(但我有偏见,因为我是作者)。
library(text2vec)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
stopwords = tm::stopwords("en")
# tokenize by whitespace
txt_toknens = strsplit(txt_fields, ' ', TRUE)
vocab = create_vocabulary(itoken(txt_toknens), ngram = c(1, 2), stopwords = stopwords)
# if you need word lengths:
# vocab$vocab = vocab$vocab[nchar(terms) > 1]
# but note, it will not remove "i_am", etc.
# you can add word "i" to stopwords to remove such terms
txt_vectorizer = vocab_vectorizer(vocab)
dtm_text = create_dtm(itoken(txt_fields), vectorizer = txt_vectorizer)
# also tokenize by whitespace, but won't create bigrams in next step
other_inputs_toknes = strsplit(other_inputs, ' ', TRUE)
vocab_other = create_vocabulary(itoken(other_inputs))
other_vectorizer = vocab_vectorizer(vocab_other)
dtm_other = create_dtm(itoken(other_inputs), vectorizer = other_vectorizer)
# combine
result = cbind(dtm_text, dtm_other)
我想将两个文档术语矩阵附加在一起。我有一行数据,想对它们使用不同的控制函数(一个 n-gram 分词器、删除停用词和文本的 wordLength 边界,其中 none 个用于我的非文本字段)。
当我使用 tm_combine: c(dtm_text,dtm_inputs) 时,它会将第二组添加为新行。我想将这些属性附加到同一行。
library("tm")
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
# NGram tokenize text data
dtm_text <- DocumentTermMatrix(Corpus(VectorSource(txt_fields)),
control = list(
tokenize = BigramTokenizer,
stopwords=TRUE,
wordLengths=c(2, Inf),
bounds=list(global = c(1,Inf))))
# Do not perform tokenization of other inputs
dtm_inputs <- DocumentTermMatrix(Corpus(VectorSource(other_inputs)),
control = list(
bounds = list(global = c(1,Inf))))
# DESIRED OUTPUT
<<DocumentTermMatrix (documents: 1, terms: 12)>>
Non-/sparse entries: 12/0
Sparsity : 0%
Maximal term length: 13
Weighting : term frequency (tf)
Terms
Docs am happy happy like like your love love your products products am store store love
1 1 1 1 1 1 1 1 1 1 1
Terms
Docs your products your store cd1_abc cd2_555 cd3_7654
1 1 1 1
1 1 1
dtm_combined = as.DocumentTermMatrix(cbind(dtm_text, dtm_inputs), weighting = weightTf)
inspect(dtm_combined)
# <<DocumentTermMatrix (documents: 1, terms: 8)>>
# Non-/sparse entries: 8/0
# Sparsity : 0%
# Maximal term length: 8
# Weighting : term frequency (tf)
#
# Terms
# Docs happy like love products store cd1_abc cd2_555 cd3_7654
# 1 1 1 1 1 1 1 1 1
但是如果你在dtm_text
和dtm_inputs
中有相同的词,它会给出错误的结果。这个词不会合并,并且会在 dtm_combined
中出现两次。
我建议使用text2vec(但我有偏见,因为我是作者)。
library(text2vec)
# Data to be tokenized
txt_fields <- paste("i like your store","i love your products","i am happy")
# Data not to be tokenized
other_inputs <- paste("cd1_ABC","cd2_555","cd3_7654")
stopwords = tm::stopwords("en")
# tokenize by whitespace
txt_toknens = strsplit(txt_fields, ' ', TRUE)
vocab = create_vocabulary(itoken(txt_toknens), ngram = c(1, 2), stopwords = stopwords)
# if you need word lengths:
# vocab$vocab = vocab$vocab[nchar(terms) > 1]
# but note, it will not remove "i_am", etc.
# you can add word "i" to stopwords to remove such terms
txt_vectorizer = vocab_vectorizer(vocab)
dtm_text = create_dtm(itoken(txt_fields), vectorizer = txt_vectorizer)
# also tokenize by whitespace, but won't create bigrams in next step
other_inputs_toknes = strsplit(other_inputs, ' ', TRUE)
vocab_other = create_vocabulary(itoken(other_inputs))
other_vectorizer = vocab_vectorizer(vocab_other)
dtm_other = create_dtm(itoken(other_inputs), vectorizer = other_vectorizer)
# combine
result = cbind(dtm_text, dtm_other)